Introduction

It is a common trope in the music world for artists to need one big break. If musicians have even just one popular song, that can often be enough to change the trajectory of their careers. But this begs the question, what makes a song popular? After all, songs, by their very nature of being music, are generally all unique. It is from this perplexing idea that the research question of this project arises: Can we predict the popularity of a song, given several different characteristics about the song?.

To answer this question, we look to data that we collected using the Spotipy API (see citation below). The data was collected using a Python script that randomly sampled several thoousand songs available on Spotify. Once a song was selected, useful data about the song was collected. This includes both categorical variables, like key, mode, etc. but also quantitative, measured variables like danceability, acousticness, energy and more. The data was compiled into a data frame and written out to a CSV file. The dataset features 3,340 observations (songs) and has 23 variables total originally.

Data Citation

Our data source is the Spotify Web API. Documentation and information on the API can be found at https://developer.spotify.com/documentation/web-api/guides/

The Python script that we used to scrape our data can be found in this repository as data_scrape.ipynb.

Data Wrangling

As discussed above, we wrote our API results into a file called spotify_dat.csv. From here, we had to do some data wrangling.

#load packages
library(tidyverse)
library(tidymodels)
library(lubridate)
library(skimr)
library(patchwork)
library(corrplot)
library(stacks)
spotify_dat <- read_csv("data/unprocessed/spotify_dat.csv") %>%
  janitor::clean_names() %>%
  rename(id = x1) %>%
  mutate(release_date = ymd(release_date)) %>%
  mutate(is_single = ifelse(total_tracks == 1, TRUE, FALSE)) %>%
  mutate(position_in_album =
           ifelse(total_tracks == 1, 0, track_number/total_tracks)) %>%
  select(-name, -album, -artist, -track_number) %>%
  mutate(mode = factor(mode),
         key = factor(key),
         time_signature = factor(time_signature))

#spotify_dat %>%
#  write_csv("data/processed/spotify_dat.csv")

We did this data wranglingto get the data into a more usable format. To summarise our changes:

Exploratory Data Analysis

With the dataset compiled in a useful format, we conduct an exploratory data analysis to get an overview of the data before fitting models.

Missingness and Outcome Variable

naniar::miss_var_summary(spotify_dat)
## # A tibble: 21 x 3
##    variable         n_miss pct_miss
##    <chr>             <int>    <dbl>
##  1 release_date        144     4.31
##  2 id                    0     0   
##  3 length                0     0   
##  4 popularity            0     0   
##  5 danceability          0     0   
##  6 acousticness          0     0   
##  7 energy                0     0   
##  8 instrumentalness      0     0   
##  9 liveness              0     0   
## 10 loudness              0     0   
## # … with 11 more rows
spotify_dat <- spotify_dat %>%
  drop_na()

After parsing, we noted that release date has 4.31% missingness. This is because some values in that field were not actually dates. We proceeded to remove the rows with missing data.

#Popularity has a bimodal distribution 
p1 <- spotify_dat %>% 
  ggplot(aes(popularity)) +
  geom_histogram()

p2 <- spotify_dat %>% 
  ggplot(aes(popularity)) + 
  geom_density()

p1+p2

From the histogram and density curve, it is interesting to see that the outcome variable popularity has a bimodal distribution. The modes occur approximately at 25 and 75, seeming to indicate that most songs are either considered fairly popular (but not extremely popular) and fairly unpopular (but not extremely unpopular).

spotify_dat %>%
  ggplot(aes(popularity)) +
  geom_boxplot()

From this boxplot, we see that popularity doesn’t have extreme skew, and so a log transform is not necessary, but there is slight right skew.

Next, we examined the relationships between variables through plots. Below are some of the highlights, the more interesting relationships between variables that we discovered.

Predictor Variables

spotify_dat %>% 
  ggplot(aes(explicit, y =..prop.., group = 1)) + 
  geom_bar(fill = "dark green", alpha = .9)

As expected, a majority of songs are not explicit. This data point may help provide some good differentiation between songs, as certain genres tend to be more explicit and those genres on average could be more or less popular.

spotify_dat %>% 
  ggplot(aes(danceability)) +
  geom_density()

The distribution of danceability is unimodal and skewed left. It appears a majority of songs are considered relatively danceable.

Response and Predictor Relationships

#Explicit songs on average are more popular and are more danceable
spotify_dat %>% 
  group_by(explicit) %>% 
  summarize(avg_popularity = mean(popularity), avg_danceability = mean(danceability))
## # A tibble: 2 x 3
##   explicit avg_popularity avg_danceability
##   <lgl>             <dbl>            <dbl>
## 1 FALSE              37.8            0.566
## 2 TRUE               52.4            0.716

Explicit songs on average are more popular and are more danceable. This was fairly expected. Popular songs these days typically feature some explicit language, even if the song is considered relatively family friendly.

#Popularity trends upwards with danceability
spotify_dat %>% 
  ggplot(aes(danceability, popularity)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

This graph confirms that popularity trends upward with danceability.

#Popularity trends up as songs get longer but then hits maximum point then trends downward
spotify_dat %>% 
  ggplot(aes(length, popularity)) +
  geom_point() +
  geom_smooth(se = FALSE)

Popularity appears to increase with song length but only up to a point. After a song gets too long, popularity decreases. Tree based methods or a KNN model may be able to incorporate this into their prediction well.

# time signature of 0 corresponds to white noise/nature noise
p3 <- spotify_dat %>% 
  ggplot(aes(time_signature, popularity, group = time_signature)) + 
  geom_boxplot()

p4 <- spotify_dat %>% 
  ggplot(aes(key, popularity, group = key)) + 
  geom_boxplot()

p3+p4

Time signature and key don’t appear to have a strong relationship with popularity. They may not be very useful predictors.

Relationship Between Predictor Variables

Next, we took a look at the correlations between variables.

##Correlation Table-popularity has highest correlation with loudness and most negative correlation with instrumentalness
numeric <- spotify_dat %>% select(danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness)
corrplot(cor(numeric), method = "number")

Many of the predictor variables have strong relationshps with each other. Acousticness and loudness have a negative correlation of -0.69, which is perhaps to be expected as acoustic songs are not generally going to be loud. Acousticness and energy have a negative correlation of -0.73, which is not suprising for a similar reason as above. Energy and loudness are also highly correlated at 0.81.

Model Fitting

Model Pre-processing

The first steps in this modeling process are setting the random seed, splitting the data, verifying the split, and cross-validation.

# Seed
set.seed(3013)

# Initial split & folding ----
spotify_split <- spotify_dat %>%
  initial_split(prop = 0.75, strata = popularity)

spotify_train <- training(spotify_split)
spotify_test <- testing(spotify_split)

# verifying split
#dim(spotify_dat)
#dim(spotify_train)
#dim(spotify_test)

# Folds
spotify_folds <- vfold_cv(spotify_train, v = 5, repeats = 3, strata = popularity)

The next step is to write the recipe.

The first step in this recipe is to use step_date() to convert the date data into a usable format.

The next step is using step_mutate() to convert the variables ‘explicit’ and ‘is_single’ to numeric.

Next, we use step_rm() to remove some columns. We will remove the following:

  • id: this is a non-predictive column that assigns a unique id to each song
  • release_date: removing the old date column

Next, we employ step_dummy() to one hot encode all categorical predictors.

Next, we use step_interact() to include interaction terms between variables that end in ‘ness’, like liveliness, acousticness, loudness, etc.

Next, we employ step_nvz() to remove any columns that have near-zero variance. These columns are highly sparse and imbalanced.

Lastly, we use step_normalize() to scale and center all predictors.

## Build general recipe (featuring eng.) ----
spotify_recipe <- recipe(popularity ~ . , data = spotify_train) %>%
  # handling dates
  step_date(release_date) %>%
  # converting logical columns to numeric
  step_mutate(explicit = as.numeric(explicit)) %>%
  step_mutate(is_single = as.numeric(is_single)) %>%
  # removing id column and old date column
  step_rm(id, release_date) %>%
  # creating dummy variables
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  # including interaction terms
  step_interact(~ ends_with('ness'):ends_with('ness')) %>%
  # removing variables with near 0 variance
  step_nzv(all_predictors()) %>%
  # normalizing variables
  step_normalize(all_predictors())

These models have run in their respective R scripts. They have been loaded below, but the code present in each script is included below as well.

The models I have chosen to run are:

Random Forest , tuning mtry and min_n

# load required objects ----
load('model_objects/spotify_folds.rda')
load('model_objects/spotify_recipe.rda')

# Define model ----
rf_model <- rand_forest(
  mode = "regression",
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger")

# set-up tuning grid ----
rf_params <- parameters(rf_model) %>%
  update(mtry = mtry(range(c(6, 16))))

# define tuning grid
rf_grid <- grid_regular(rf_params, levels = 5)

# workflow ----
rf_workflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(spotify_recipe)

# Tuning/fitting ----
tic("random forest")

rf_tune <- rf_workflow %>%
  tune_grid(resamples = spotify_folds, grid = rf_grid,
            control = control_stack_grid())

toc(log = TRUE)

# save runtime info
rf_runtime <- tic.log(format = TRUE)

# Write out results & workflow
save(rf_tune, rf_workflow, file = "model_objects/rf_tune.rda")

Boosted tree , tuning mtry, min_n, and learn_rate

# load required objects ----
load('model_objects/spotify_folds.rda')
load('model_objects/spotify_recipe.rda')

# Define model ----
bt_model <- boost_tree(
  mode = "regression",
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("xgboost")

# set-up tuning grid ----
bt_params <- parameters(bt_model) %>%
  update(mtry = mtry(range(c(6, 14))))

# define tuning grid
bt_grid <- grid_regular(bt_params, levels = 5)

# workflow ----
bt_workflow <- workflow() %>%
  add_model(bt_model) %>%
  add_recipe(spotify_recipe)

# Tuning/fitting ----
tic("boosted tree")

bt_tune <- bt_workflow %>%
  tune_grid(resamples = spotify_folds, grid = bt_grid,
            control = control_stack_grid())

toc(log = TRUE)

# save runtime info
bt_runtime <- tic.log(format = TRUE)

# Write out results & workflow
save(bt_tune, bt_workflow, file = "model_objects/bt_tune.rda")

Support Vector Machine, tuning cost and rbf_sigma

# load required objects ----
load("model_objects/spotify_recipe.rda")
load("model_objects/spotify_folds.rda")

# Define model ----
svm_model <- svm_rbf(
  mode = "regression",
  cost = tune(),
  rbf_sigma = tune()
) %>%
  set_engine("kernlab")

# # check tuning parameters
# parameters(svm_model)

# set-up tuning grid ----
svm_params <- parameters(svm_model)

# define grid
svm_grid <- grid_regular(svm_params, levels = 5)

# workflow ----
svm_workflow <- workflow() %>%
  add_model(svm_model) %>%
  add_recipe(spotify_recipe)

# Tuning/fitting ----
svm_res <- svm_workflow %>%
  tune_grid(
    resamples = spotify_folds,
    grid = svm_grid,
    control = control_stack_grid()
  )

# Write out results & workflow
save(svm_res, svm_workflow, file = "model_objects/svm_res.rda")

MARS tuning num_terms and prod_degree

# MARS ----
set.seed(3013)

# Load package(s) ----
library(tidyverse)
library(tidymodels)
library(tictoc)
library(stacks)

# load required objects ----
load('model_objects/spotify_recipe.rda')
load('model_objects/spotify_folds.rda')

# Define model ----
mars_model <- mars(
  mode = "regression",
  num_terms = tune(),
  prod_degree = tune()
) %>%
  set_engine("earth")

# set-up tuning grid ----
mars_params <- parameters(mars_model)  %>%
  update(num_terms = num_terms(range(c(5, 15))),
         prod_degree = prod_degree(range(c(1, 5))))

# define tuning grid
mars_grid <- grid_regular(mars_params, levels = 5)

# workflow ----
mars_workflow <- workflow() %>%
  add_model(mars_model) %>%
  add_recipe(spotify_recipe)

# Tuning/fitting ----
tic("mars")

mars_tune <- mars_workflow %>%
  tune_grid(resamples = spotify_folds,
            grid = mars_grid,
            control = control_stack_grid())

toc(log = TRUE)

# save runtime info
mars_runtime <- tic.log(format = TRUE)

# Write out results & workflow
save(mars_tune, mars_workflow, file = "model_objects/mars_tune.rda")

These four models were chosen because they have demonstrated in previous projects that they work well with a predictive regression problem.

load("model_objects/rf_tune.rda")
load("model_objects/bt_tune.rda")
load("model_objects/svm_res.rda")
load("model_objects/mars_tune.rda")
load("model_objects/model_stack.rda")
load("model_objects/model_stack_trained.rda")

Lastly, we created an ensemble model using the tuned results of the four models above. The code for this is present below.

# Load split data object & get testing data
load("model_objects/spotify_test.rda")

# Create data stack ----
spotify_data_st <- 
  stacks() %>% 
  add_candidates(mars_tune) %>%
  add_candidates(svm_res) %>%
  add_candidates(rf_tune) %>%
  add_candidates(bt_tune)

spotify_data_st

# Fit the stack ----
# penalty values for blending (set penalty argument when blending)
blend_penalty <- c(10^(-6:-1), 0.5, 1, 1.5, 2)

# Blend predictions using penalty defined above (tuning step, set seed)
set.seed(9876)

spotify_model_st <- 
  spotify_data_st %>% 
  blend_predictions(penalty = blend_penalty)

# Save blended model stack for reproducibility & easy reference (Rmd report)
save(spotify_model_st, file = "Model_objects/model_stack.rda")

# Explore the blended model stack
spotify_model_st

autoplot(spotify_model_st)

autoplot(spotify_model_st, type = "weights")

# fit to ensemble to entire training set ----
spotify_model_st <-
  spotify_model_st %>%
  fit_members()

# Save trained ensemble model for reproducibility & easy reference (Rmd report)
save(spotify_model_st, file = "model_objects/model_stack_trained.rda")

Model Selection

Since this is a regression classification problem, it is appropriate to use RMSE as a performance measure.

In the charts below, we see how RMSE varies by tuning parameters for each model.

Random Forest

autoplot(rf_tune, metric = "rmse")

Boosted Tree

autoplot(bt_tune, metric = "rmse")

Support Vector Machine

autoplot(svm_res, metric = "rmse")

Mars Model

autoplot(mars_tune, metric = "rmse")

Ensemble Model

autoplot(spotify_model_st, metric="rmse")

We see that the ensemble model was able to achieve the best RMSE across the training set folds. Thus, we select the ensemble model as the model to use on the testing set.

Model Performance Evaluation

Now that we’ve selected a model, we fit it to the testing data. Below, we can compare acutal popularity to the ensemble’s predictions.

### EXPLORE AND ASSESS ENSEMBLE MODEL
load("model_objects/spotify_test.rda")

spotify_test <- 
  spotify_test  %>% 
  bind_cols(predict(spotify_model_st, .))

spotify_test
## # A tibble: 801 x 22
##       id release_date length popularity danceability acousticness  energy
##    <dbl> <date>        <dbl>      <dbl>        <dbl>        <dbl>   <dbl>
##  1     0 2019-01-16    96053         23        0.383     0.203    0.397  
##  2     5 2020-12-08   189538         40        0.779     0.111    0.6    
##  3     7 2017-09-27   218573         44        0.743     0.261    0.924  
##  4     9 2019-05-10   177226         41        0.416     0.993    0.00661
##  5    15 2019-10-04   189322         82        0.474     0.0687   0.631  
##  6    20 2021-02-26   207692         26        0.543     0.000658 0.782  
##  7    24 2008-01-01   217613         39        0.591     0.00014  0.896  
##  8    25 2017-10-06   185176         77        0.207     0.925    0.243  
##  9    28 2021-03-09   118404         34        0.396     0.275    0.926  
## 10    33 2020-07-10   147547         75        0.661     0.538    0.665  
## # … with 791 more rows, and 15 more variables: instrumentalness <dbl>,
## #   liveness <dbl>, loudness <dbl>, speechiness <dbl>, tempo <dbl>,
## #   time_signature <fct>, key <fct>, mode <fct>, valence <dbl>,
## #   avail_markets <dbl>, total_tracks <dbl>, explicit <lgl>, is_single <lgl>,
## #   position_in_album <dbl>, .pred <dbl>

We can also compare the predictions of different models. We can see that some models tended to underpredict where others overpredicted.

member_preds <- 
  spotify_test %>%
  select(popularity) %>%
  bind_cols(predict(spotify_model_st, spotify_test, members = TRUE))

member_preds
## # A tibble: 801 x 8
##    popularity .pred mars_tune_1_24 mars_tune_1_20 svm_res_1_20 rf_tune_1_05
##         <dbl> <dbl>          <dbl>          <dbl>        <dbl>        <dbl>
##  1         23  35.6           43.4           41.4         20.5         36.8
##  2         40  44.8           44.4           42.1         57.1         41.0
##  3         44  36.8           37.4           41.0         27.9         37.5
##  4         41  41.8           30.5           30.2         40.7         45.2
##  5         82  59.9           59.3           58.8         73.4         56.6
##  6         26  56.4           58.4           54.9         61.7         54.1
##  7         39  57.6           51.9           48.4         72.2         56.6
##  8         77  39.2           30.2           28.9         49.9         43.2
##  9         34  36.1           35.1           37.6         33.9         39.2
## 10         75  72.9           69.2           68.1         69.1         71.1
## # … with 791 more rows, and 2 more variables: bt_tune_1_08 <dbl>,
## #   bt_tune_1_05 <dbl>

We can compare the estimated RMSE of each member of the ensemble, and also compare this with the ensemble itself. Interestingly, the ensemble model performed very slightly worse than one of the random forest models on the test set, although the ensemble outperformed it on the training set.

The ensemble model achieves an RMSE of 18.19903 on the testing set.

map_dfr(member_preds, rmse, truth = popularity, data = member_preds) %>%
  mutate(member = colnames(member_preds)) %>%
  arrange(.estimate)
## # A tibble: 8 x 4
##   .metric .estimator .estimate member        
##   <chr>   <chr>          <dbl> <chr>         
## 1 rmse    standard         0   popularity    
## 2 rmse    standard        18.1 rf_tune_1_05  
## 3 rmse    standard        18.2 .pred         
## 4 rmse    standard        19.4 bt_tune_1_05  
## 5 rmse    standard        19.9 bt_tune_1_08  
## 6 rmse    standard        20.2 mars_tune_1_24
## 7 rmse    standard        21.2 mars_tune_1_20
## 8 rmse    standard        21.7 svm_res_1_20

Based on the RMSE score achieved, we believe our model is fairly useful, although it is not perfect. A song’s popularity can range from 1 to 100, so a root mean squared error of about 18 indicates our model is somewhat useful at predicting a songs popularity. It is certainly better than random choice.

Conclusion and Debrief

From this model, it appears that the characteristics of a song can indeed roughly predict a song’s popularity, with some error. This result is significant and intuitive. As one might expect, certain attributes of a song influence its popularity, like its danciness or length. However, given the error of our model, there seems to be either variables not included in our dataset that influence a song’s popularity, or perhaps more likely, a random element to which songs become popular.

We believe that with additional data resources, this model could be improved further. Firstly, it is likely that the model could be improved with a larger training set. While limitations on time and on our homemade random song query-er stopped us from collecting more data, this would likely improve our model. We hope that the Spotify API adds a feature for querying a random song, as this would increase the ease of collecting such data. Additionally, it is likely that more variables could improve our model. For instance, we think that additional information on the artist, for instance their country of origin or whether they are signed to a record label, might be useful for prediction.

We think that both the random forest model and the ensemble model turned out to be the best because of a few advantages. First, random forests perform well with high dimensionality, since it works with subsets of the data. Since there ended up being about two dozen predictors in this model, ability to handle high dimensionality may have contributed to the random forest’s success. The ensemble model likely performed well because it was able to average out the strengths and weaknesses of its members.

This experience has raised a few new research questions for us. Namely, we have become interested in how predictors of a song’s popularity may vary by country. For instance, we hypothesize that dancibility is more important in some countries, while speechiness may be more important in others. We think that we might be able to answer this question by fitting some sort of linear regression model on each of several countries of interest, then performing a meta-analysis of our results. We think this could be an interesting extention of our project.