Music Recommender System — Part 2

Naga Sanka

Published in

Analytics Vidhya

7 min readDec 8, 2021

Get the music dataset and perform Exploratory Data Analysis

Recap

In the previous article, we created Development Environment with all the necessary Python libraries.

In this article, let’s get the dataset that we use which is the dataset provided as part of the Spotify Million Playlist Dataset (MPD) Challenge. In order to prepare the dataset to use in machine learning models, we need to perform some data cleaning and data manipulation tasks. We will also explore the dataset to know the features and combine with additional data fields obtained via the Spotify API.

Getting access to Spotify API

Before getting and exploring the playlist dataset, first we need to get the client credentials to the Spotify API. If you don’t have Spotify account, you need to create one free/paid account. After getting account, head over to Spotify for Developers, click on your Dashboard and accept their terms. Next click on ‘Create an App’ and give App name, description. Once the App is created, the overview page will be opened and you should see a “Client ID” and “Client Secret” on the left-hand side, we will use these to get additional data fields. The “spotipy” Python Library will be used to connect to Spotify API and it has two authorization methods.

Getting the Dataset

We will use the dataset provided by Spotify to enable research in music recommendations. This dataset includes public playlists created by US Spotify users between January 2010 and November 2017. It has 1 million Spotify playlists, over 2 million unique tracks, nearly 300,000 artists and is available here. First we need to Sign-up as a member in order to access the dataset files. After you create an account and Login, copy the link for “spotify_million_playlist_dataset.zip” and then enter the below command in Terminal to download, replacing the “dataset_url”, in your workspace created in Part 1. The zip file is above 5GB, so it takes sometime to download.

# Download zip file
wget "dataset_url" -O spotify_million_playlist_dataset.zip
# To see all files list in zip file
unzip -l spotify_million_playlist_dataset.zip
# Extract README.md file from zip file
unzip -p spotify_million_playlist_dataset.zip README.md > README.md

As the GitPod instance has only ~30GB available disk space, we will extract only the required folders/files. The README file extracted from the dataset, as shown above command, will give you more details on how the data is stored in files, and on the individual metadata fields. The zip file also contains few python files in “src” folder, in which “stats.py” is used to compute number of statistics of the dataset. We can extract the “src” folder from the zip file as below and look at the stats.py file.

# Extract src folder from zip file
unzip spotify_million_playlist_dataset.zip "src/*" -d .

Quick approach

The dataset zip file has 1000 json files in the data folder and each json file has 1000 playlists. Each playlist has number of tracks. We can follow the similar approach given in the stats.py file which loops over each file and then each playlist to get the information. We want to quickly move to modeling part, so we decided to go with partial data. We extracted 20 json files and read the information from each file and added to a list as shown below.

import os
import jsondef loop_slices(path, num_slices=20):
  cnt = 0
  mpd_playlists = []
  filenames = os.listdir(path)
  for fname in sorted(filenames):
    print(fname)
    if fname.startswith("mpd.slice.") and fname.endswith(".json"):
      cnt += 1
      fullpath = os.sep.join((path, fname))
      f = open(fullpath)
      js = f.read()
      f.close()
      current_slice = json.loads(js)      # Create a list of all playlists
      for playlist in current_slice['playlists']:
        mpd_playlists.append(playlist)      if cnt == num_slices:
        break  return mpd_playlists# Path where the json files are extracted
path = 'data/'
playlists = loop_slices(path, num_slices=20)

The next step is to extract the audio features for each track in the playlist. We used spotipy library to extract the required audio features from Spotify. Finally, we took the average of the audio features of all tracks in each playlist for our modeling purpose.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials# Spotify credentials
os.environ["SPOTIPY_CLIENT_ID"] = "Replace with Client ID"
os.environ["SPOTIPY_CLIENT_SECRET"] = "Replace with Client Secret"
os.environ['SPOTIPY_REDIRECT_URI'] = "http://localhost:8080"sp = spotipy.Spotify(client_credentials_manager =      
                     SpotifyClientCredentials())cols_to_keep = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']dfs = []
for playlist in tqdm(playlists):
  audio_feats = []
  for track in playlist['tracks']:
    track_uri = track['track_uri'].split(":")[2]
    feature = sp.audio_features('track_uri')
    if feature:
      audio_feats.append(feature[0])  avg_feats = pd.DataFrame(audio_feats)[cols_to_keep].mean()
  avg_feats['name'] = playlist['name']
  avg_feats['pid'] = playlist['pid']
  dfs.append(avg_feats.T)

After extracting all the average audio features for the 20,000 playlists, we proceeded to build and train machine learning models.

Process One Million Playlists

We got decent recommendations using 20,000 playlists and clustering model as described in other article. But, when we tested with different user playlists, it failed for some users. So, we want to retrain the models using complete dataset. For that, we need to get the average audio features for all the playlists. The above mentioned quick approach didn’t work or took so much time to read and extract the features. So, we updated the data processing code. While updating the code, we found that we don’t need to extract features for each track as some of the tracks are repeating in the playlists. The below plots show the new, existing and total tracks change for first few json file. Even though each json file has more than 60K tracks, there are only 30K-35K unique tracks and out of them only below 5K new tracks after first few json files.

Few process improvements were applied to the data processing code. The first one is, instead of extracting all the json files, we used Python “zipfile” library to directly read from the zip file.

import os
import json
from zipfile import ZipFile
import fnmatchdef extract_mpd_dataset(zip_file, num_files=0, num_playlists=0):
  with ZipFile(zip_file) as zipfiles:
    file_list = zipfiles.namelist()
    #get only the csv files
    json_files = fnmatch.filter(file_list, "*.json")
    json_files = [f for i,f in sorted([(int(filename.split('.')[2].split('-')[0]), filename) for filename in json_files])]    cnt = 0
    for filename in json_files:
      cnt += 1
      print('\nFile: ' + filename)
      with zipfiles.open(filename) as json_file:
        json_data = json.loads(json_file.read())
        process_json_data(json_data, num_playlists)      if (cnt == num_files) and (num_files > 0):
        breakzip_file = 'data/spotify_million_playlist_dataset.zip'
extract_mpd_dataset(zip_file, 0, 0)

The second improvement is to read the complete file instead of reading each playlist at a time. We used Pandas “json_normalize” function to convert the json file into DataFrame. This helped us to process the playlists and tracks very easily. The sample code is below, you can see the full code in the GitHub.

import pandas as pd
def process_json_data(json_data, num_playlists):
  # Get all playlists in the file
  playlists_df = pd.json_normalize(json_data['playlists'])  # Get all the tracks in the file
  tracks_df = pd.json_normalize(json_data['playlists'], record_path=['tracks'], meta=['pid', 'num_followers'])

We used Pandas inbuilt SQLite functionality to read and save the DataFrames into tables. There are four tables that were created, those are playlists, tracks, features and ratings. The playlists, tables and features have unique items and the ratings table has link between playlists and tracks. In order to create the unique tracks table, we removed duplicate tracks not only from this file, but also the tracks those exist in the database. Here is the sample code to save the tables in the database.

import sqlite3
db_file = 'data/spotify_million_playlists.db'
conn = sqlite3.connect(db_file)
# Create playlists table
playlists_df.to_sql(name='playlists', con=conn, if_exists='append', index=False)# Added unique track_id for each non-duplicate tracks
# Create ratings table in database
ratings_df = tracks_df[['pid', 'track_id', 'pos', 'num_followers']]# Remove all duplicate tracks, Create tracks table
tracks_df.drop(['pos', 'duration_ms', 'pid', 'num_followers'], axis=1, inplace=True)
tracks_df = tracks_df.drop_duplicates(subset='track_uri', keep="first")
tracks_df.to_sql(name='tracks', con=conn, if_exists='append', index=False)

The last and final modification that we performed is to get the audio features for multiple tracks at once. The spotipy library can get audio features for 100 tracks at a time. So, we updated the code to get all “track_uri” from tracks table and process 100 tracks at a time. Below is the sample code.

cur = conn.cursor()
cur.execute('''select track_id, track_uri from tracks where (track_id > ?) and (track_id <= ?)''', (0, 100))
rows = cur.fetchall()
uris = [row[1] for row in rows]feats_list = sp.audio_features(uris)
# Remove None items, for some tracks there are no features
feats_list = [item for item in feats_list if item]feats_df = pd.DataFrame(feats_list)
columns = ['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms','time_signature']
feats_df = feats_df[columns]
track_id_list = range(1, 101)
feats_df.insert(loc=0, column='track_id', value=track_id_list)feats_df.to_sql(name='features', con=conn, if_exists='append', index=False)

The complete process of reading one million playlists and getting the audio features for each unique song from Spotify takes more than 2 hours system time. So, we added code to save the logs in a text file for easy tracking and you can see the complete log file in my GitHub.

Next Step

In next article, we will see the build and train machine learning models using the data that we collected in this.

References:

If you enjoy reading my articles and want to support me, please consider signing up to become a Medium member. It’s $5 a month and gives you unlimited access to stories on Medium. Please signup using my link to support me: https://nsanka.medium.com/membership.