Splunking The Billboard Hot 100 with help from the Spotify API

By: Karthik November 01, 2017

There's a lot of data out there and once we put it into Splunk, there's a lot of interesting information we can pull out of it, so why not have a trip down memory lane and see what sort of songs pop up when going through the Billboard Hot 100 charts from now back to 2000?

First, I found a scraper for the data - thank you Allen Guo for your Billboard charts scraper - and output the data in this format:

date | title | artist | weeks | delta | current | peak | previous | spotifyID

I did so with this python script:

import billboard
import json
import datetime


outfilename = 'output.psv'
counter = 1000
chart_type = 'hot-100'
chart = billboard.ChartData(chart_type)
chart_date = '2017-07-01'
first_line = 'date | title | artist | weeks | delta | current | peak | previous | spotifyID\n'


with open(outfilename, 'a') as outputfile:
   outputfile.write(first_line)


for i in range (1,counter+1):
    for position in range (0,99):
        song = chart[position]
        line_out = unicode(str(chart_date) + ' | ' + unicode(song.title) + ' | ' + unicode(song.artist) + ' | ' + str(song.weeks) + ' | ' + str(song.change) + ' | ' + str(position) + ' | ' + str(song.peakPos) + ' | ' + str(song.lastPos) + ' | ' + str(song.spotifyID) + '\n')
        songidout = str(song.spotifyID) + '\n'
        with open(outfilename, 'a') as outputfile:
            outputfile.write(line_out.encode('utf8'))
    print chart.previousDate
    print chart[0]
    chart_date = chart.previousDate
    chart = billboard.ChartData(chart_type, str(chart.previousDate))
print 'done'

It's actually only been about 910 weeks between now and January 1, 2000; but I figured why not just make the dataset a round 1000 and then use the date to filter data?

Next, I created a props.conf with these settings:

[billboard_top_100]
CHARSET=UTF-8
MAX_DAYS_AGO=10000
MAX_DAYS_HENCE=25
NO_BINARY_CHECK=true
SHOULD_LINEMERGE=false
TIME_FORMAT=%Y-%m-%d
category=Custom
description=Billboard Top 100
disabled=false
pulldown_type=true
REPORT-BillboardMainFields=REPORT-BillboardMainFields
EVAL-Position = CurrentPosition+1

Since all of my data is on different lines (I output it that way) we set SHOULD_LINEMERGE to false. Other things we need to set include MAX_DAYS_HENCE because Billboard charts can, weirdly, be up to nearly 3 weeks in the future, and MAX_DAYS_AGO because the default of 2000 days just isn't enough to take us back to party like it's 1999.

(The EVAL statement is in there because, if you notice my python code above, position goes from 0-99, and I made the rookie mistake of outputting position instead of position+1, so I used an eval statement to fix it. If you debug your code and don't run it at 2AM just trying to get the thing running, you probably won't need that sort of statement in there. I also didn't want to reindex data, even though it was only about 12 MB of data.) Public Service Announcement: Always debug your code because computers do what you tell them to do, not what you want them to do...

And a transforms.conf here:

[REPORT-BillboardMainFields]
DELIMS = "|"
FIELDS = "Chart_Date","Title","Artist","WeeksOnChart","ChangeFromPrevious","CurrentPosition","PeakPosition","PreviousPosition","spotifyID"

While I could have used "INDEXED-EXTRACTIONS" with the PSV command, I stuck with keeping the big 5 as my only indexed fields, and did these extractions here.

There are plenty of things that we can do with this data, like answer the following random questions that probably tell you more about the workings of my mind than about music trends:

Who had more Billboard top 100 hits, Miley Cyrus or Hannah Montana? (Surprisingly, each had exactly 18 charted songs.)
How many pop songs from my nascent adolescence had I completely forgotten about until setting out on this task? (Too many to count.)
What artist/group/band has had the most top 100 hits since 2000? (More on this later at this post, because: reasons.)
What's the stats on Beyoncé's songs compared to Destiny's Child? (Destiny's Child charted 18 songs, but Queen Bey has charted 48.)
What happened to the solo careers of the Black Eyed Peas? (will.i.am charted 18 hits; 8 of which he was the main or solo artist, 10 of which he was just featured. Fergie charted 12; 9 of which she was the main or solo artist, 3 of which she was featured.)
Were there REALLY that many songs by the Black Eyed Peas? (Damn, the answer is yes. Yes, there were 16 charted Black Eyed Peas songs, including one that featured Macy Gray.)
What ever happened to Macy Gray? (Hmmm, seems like there are some questions even Splunk can't answer.)

And, while we can do a lot with this data, we can do some more if we take these songs and link them to Spotify data, so let's do that by first isolating song IDs in a file:

index="billboard" Chart_Date>2000-01-01 
| stats values(spotifyID) as IDs 
| dedup IDs 
| mvexpand IDs 
| fields - _mkv_child 
| outputlookup song_ids.csv

That search first takes all the data indexed in the "billboard" index since January 1, 2000, then pulls out the spotifyIDs. After deduplicating the IDs, we use the mvexpand to turn it from a multivalued field with all the IDs in one event to a single valued field. The way that Splunk does this is that it creates an internal field called "_mkv_child" and gives it an index number from 1 to the total number of events. Because we don't want to see this file in our output, we remove that field, and then we take all our IDs and use the "outputlookup" command to get them. Using the "| dedup" function drops our over 90,000 events into about 6100 unique Spotify song IDs, and we can now take that data and create a script to pull all the data that Spotify has on these songs:

import codecs, requests, json, os, time, spotipy, spotipy.util, sys


inputfile = 'June26_songidonly.csv'
outfilename = 'June26_spotify_info.json'

token = 'my secret token that spotify generated for me'
headers = {'Accept': 'application/json',

           'Authorization': 'Bearer ' + token}


with codecs.open(inputfile, 'r', 'utf-8') as f:
    for song_id in f:
        currentsong = requests.get('https://api.spotify.com/v1/tracks/?ids=' + song_id.strip(), headers=headers)
        print(currentsong)
        with open(outfilename, 'a') as outputfile:
            json.dump(currentsong.json(), outputfile)
print 'done'

(First, hat tip to Susan who debugged this and saved me a lot of whatever dwindling sanity I have remaining. Second, while Spotify tokens are only active for an hour, it's still better not to throw a token in cleartext onto the internet, the security professional in me says. Finally, if you need a shell script version of this because: reasons, you can get pretty much all of the files I used to make this work here. The reason I have a shell script has to do with nearly quitting on Python while trying to debug the code above.)

These data are all available in JSON, and then we let Splunk eat JSON, as it is wont to do, giving these the sourcetype "billboard_spotify_json", and then we can use the join command - using the spotifyID field, to attach all the spotify API info to the data we previously had, using this base search here:

index="billboard" sourcetype="billboard_top_100" Chart_Date > 2000-01-01
| join spotifyID 
    [ search index="billboard" sourcetype="billboard_spotify_json" 
    | spath "tracks{}.id" 
    | rename "tracks{}.id" as spotifyID]

And now we can do things like find out how song popularity trends over time - spoiler alert - people listen to pop songs less as the songs get older, but that there are also really interesting outliers when we look at the top 10 songs for every Hot-100 chart since 2000:

Screen Shot 2017-06-26 at 3.10.42 PM.png

There are some interesting spikes, one of which is the June 17, 2000 chart:

Screen Shot 2017-06-26 at 3.16.07 PM.png

Notwithstanding that two songs don't have SpotifyIDs, it's amazing that the other 8 songs seem to have staying power. (And, until this moment, I'd completely forgotten about Joe's "I Wanna Know.") Also, Spotify needs to get on acquiring Aaliyah's entire discography. The issue with Santana's song is that the scraper wasn't able to match it due to Spotify only having the "Radio Remix," but that song also has a popularity of over 50.

And now, because "I Wanna Know" how popular "I Wanna Know" is, I found out that it's more popular than 100(!!!) different previous number 1 songs:

Screen Shot 2017-06-26 at 3.28.25 PM.png

Crazy, right? Such is the life of a number one song - one day you're hot stuff, and then, another day, you can't even compete with a release from 2000 that's by an artist who hasn't consistently charted in the early 2000s and last charted for a couple of weeks in 2007!

Oh, also, just to let you know... the artist/group/band that has produced more Hot 100 hits than anyone else since 2000? The Glee Cast:

Screen Shot 2017-06-26 at 3.37.14 PM.png

Yup, that just happened. There are 11 artists overall who have produced more than 30 Hot 100 hits, and somehow, the Glee Cast blew all the other 10 out of the water. 3,213 artists enter, and the Glee Cast is triumphant! There are a lot of sentences I've never thought I'd type in my life... but that one is so far out there that I never even thought to think I'd never type it!

There's a lot more that we can find out, which is why I've provided scripts and props here if you want to explore more trends! Happy Splunking!

Credit and thanks: It has to be said again, but I could not have done this project without Allen Guo's Billboard chart scraper here, and I would still be smacking my head against a wall trying to make my python code go without Susan's invaluable proofreading, programming, and debugging skills.

Tags: music, Splunk 6.x, Operational Intelligence, Python, Spotify, Billboard, Data Trends, Data Science

Blog