As music changes rapidly throughout the centuries, each generation's taste in music does as well. Today's modern technology provides exposure to music of different genre, language, and even culture. Thus, in order to analyze how this newly-introduced notion impacted the society's taste in music, the generation who grew up with the Internet, Generation Z, is the perfect candidate to use.
Generation Z refers to those who were born between 1997 – 2012 (ages 9 - 24). Although Spotify (a music streaming app) has ample data all of their songs, it is difficult to sift out secific data on Gen Z. What other resources can we turn to?
...
TikTok! This video-sharing app is dominated by Gen Z, as 70% of the users are between 9 - 24 years old. Although not all videos use songs in the background, for simplicity, we will only explore videos that contains songs.
Are you still interested in stats for Spotify? No worries, I will take this opporunity to dive into this app as well ᕦ(ò_óˇ)
Finally, I will take data from both apps to draw comparisons between these two, as well as see whether TikTok made an impact on trending songs in Spotify.
Some necessary imports for this tutorial include: pandas, numpy, seaborn, matplotlib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
In order to get a more precise analysis on TikTok, I found a random sample set of TikTok dataset with around 6,500 TikTok videos from 2021 that contains a song here. Each entry contains attributes to that particular video, such as name of the track played, the artist of that track, duration, and so on. The data was readily available in a csv file, so I downloaded the file and loaded it into a dataframe. Because not every attribute will be used in our analysis, I dropped the columns that were irrelevant to clean up the data.
# load tiktok data into dataframe
tiktok_data = pd.read_csv('tiktok.csv')
# drop unnecessary columns
tiktok_data = tiktok_data.drop(['Unnamed: 0','track_id','artist_id','album_id',
'playlist_id','genre','playlist_name','duration_mins'], axis = 1)
tiktok_data.head()
track_name | artist_name | duration | release_date | popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lay It Down Gmix - Main | Lloyd | 302186 | 2011-01-01 | 28 | 0.597 | 0.800 | 1 | -5.423 | 0 | 0.3120 | 0.0461 | 0.0 | 0.1800 | 0.565 | 155.932 |
1 | Bartender (feat. Akon) | T-Pain | 238800 | 2007-06-05 | 75 | 0.832 | 0.391 | 8 | -8.504 | 1 | 0.0628 | 0.0564 | 0.0 | 0.2240 | 0.436 | 104.961 |
2 | Bartender (feat. Akon) | T-Pain | 238800 | 2007-06-05 | 75 | 0.832 | 0.391 | 8 | -8.504 | 1 | 0.0628 | 0.0564 | 0.0 | 0.2240 | 0.436 | 104.961 |
3 | Chosen (feat. Ty Dolla $ign) | Blxst | 161684 | 2020-12-04 | 76 | 0.571 | 0.767 | 2 | -5.160 | 1 | 0.2870 | 0.3360 | 0.0 | 0.0809 | 0.605 | 93.421 |
4 | Tie Me Down (with Elley Duhé) | Gryffin | 218295 | 2018-08-03 | 72 | 0.548 | 0.839 | 6 | -2.371 | 1 | 0.0644 | 0.1350 | 0.0 | 0.1020 | 0.314 | 98.932 |
As for the spotify data, I repeated the process with downloading the data as a csv file from here and putting it into a dataframe. However, the dataset itself is the top 50 songs in the USA in 2021. Therefore, the Spotify dataset we are exploring only contains 50 entries with attributes such as the song's title, artist, duration and so on as well. Similarly, I dropped the irrelevant columns to our analysis.
# load spotify data into dataframe
spotify_data = pd.read_csv('spotify_top50_2021.csv')
# drop unnecessary columns
spotify_data = spotify_data.drop(['id','track_id'], axis = 1)
spotify_data.head()
artist_name | track_name | popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Olivia Rodrigo | drivers license | 92 | 0.561 | 0.431 | 10 | -8.810 | 1 | 0.0578 | 0.76800 | 0.000014 | 0.1060 | 0.137 | 143.875 | 242013 | 4 |
1 | Lil Nas X | MONTERO (Call Me By Your Name) | 90 | 0.593 | 0.503 | 8 | -6.725 | 0 | 0.2200 | 0.29300 | 0.000000 | 0.4050 | 0.710 | 178.781 | 137704 | 4 |
2 | The Kid LAROI | STAY (with Justin Bieber) | 92 | 0.591 | 0.764 | 1 | -5.484 | 1 | 0.0483 | 0.03830 | 0.000000 | 0.1030 | 0.478 | 169.928 | 141806 | 4 |
3 | Olivia Rodrigo | good 4 u | 95 | 0.563 | 0.664 | 9 | -5.044 | 1 | 0.1540 | 0.33500 | 0.000000 | 0.0849 | 0.688 | 166.928 | 178147 | 4 |
4 | Dua Lipa | Levitating (feat. DaBaby) | 89 | 0.702 | 0.825 | 6 | -3.787 | 0 | 0.0601 | 0.00883 | 0.000000 | 0.0674 | 0.915 | 102.977 | 203064 | 4 |
The current TikTok dataframe merely contains entires in no particular order. To get an idea of this dataset visually, I compressed the dataframe by adding a column that contains the number of times a song was used in this dataset. This way, a song only takes up one entry in the dataframe, which reduced the number of entires by about half.
Using the newly made "count" column, I then sorted all the entries in descending count order, meaning the songs that are most used are at the top. Because we are analyzing the top 50 songs in Spotify, in order to get a closer comparison, I kept the top 50 songs in the TikTok dataframe and dropped the rest.
# update dataframe with new column 'count'
tiktok_data['count'] = tiktok_data['track_name'].map(tiktok_data['track_name'].value_counts())
# sort songs by most used to least used and remove duplicate entries
tiktok_data = tiktok_data.sort_values(by=['count'], ascending=False)
tiktok_data = tiktok_data.drop_duplicates(subset = "track_name")
# drop the rows that are not 50 most used
tiktok_data = tiktok_data.drop(tiktok_data.index[50:])
tiktok_data = tiktok_data.reset_index(drop=True)
tiktok_data.head()
track_name | artist_name | duration | release_date | popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Don't Start Now | Dua Lipa | 183290 | 2019-10-31 | 85 | 0.794 | 0.793 | 11 | -4.521 | 0 | 0.0842 | 0.0125 | 0.000000 | 0.0952 | 0.677 | 123.941 | 26 |
1 | What You Know Bout Love | Pop Smoke | 160000 | 2020-07-03 | 87 | 0.709 | 0.548 | 10 | -8.493 | 1 | 0.3530 | 0.6500 | 0.000002 | 0.1330 | 0.543 | 83.995 | 24 |
2 | OUT WEST (feat. Young Thug) | JACKBOYS | 157712 | 2019-12-27 | 83 | 0.802 | 0.591 | 8 | -4.895 | 1 | 0.2250 | 0.0104 | 0.000000 | 0.1960 | 0.309 | 139.864 | 23 |
3 | drivers license | Olivia Rodrigo | 242013 | 2021-01-08 | 94 | 0.585 | 0.436 | 10 | -8.761 | 1 | 0.0601 | 0.7210 | 0.000013 | 0.1050 | 0.132 | 143.874 | 23 |
4 | No Idea | Don Toliver | 154424 | 2019-05-29 | 5 | 0.651 | 0.631 | 6 | -5.717 | 0 | 0.0896 | 0.5190 | 0.000579 | 0.1650 | 0.350 | 127.994 | 23 |
Before diving in, because not everyone is familiar with the musical terminologies, I would like to provide an explanation of each attribute/column of the songs. The columns dropped, however, will not be featured in this section.
Duration - how long the song is, in milliseconds
Popularity - The higher the value the more popular the song is; the measure of this is precalcuated and provided by the dataset
Danceability - The higher the value, the easier it is to dance to this song; value is precalculated
Energy - he higher the value, the more energtic the song is; value is precalculated
Key - The key the song is in. Each integer is associated with a specific pitch in the standard Pitch Class notation. For example, 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1
Loudness - The higher the value, the louder the song; value is precalculated
Mode - Specifies the modality (major or minor) of a track, which is the type of scale from which its melodic content is derived. Major is represented by the value 1 and minor is 0
Speechiness - The higher the value the more spoken word the song contains; value is precalculated
Acousticness - The higher the value the more acoustic the song is; value is precalculated
Instrumentalness - the number of vocals in a song. The closer the value to 1.0, the more instrumental the song is
Liveness - The higher the value, the more likely the song is a live recording
Valence - The higher the value, the more positive mood for the song
Tempo - The overall tempo of a song in beats per minute (BPM), usually indicates how fast a song is
Time signature - An indication of rhythm, generally represented as a fraction with the denominator defining the beat as a division of a whole note and the numerator giving the number of beats in each bar.
The bar graph below provides a visualization of the newly reformatted TikTok data. I also added a line that marks the average count to get an insight on the distribution of just the top 50 songs on TikTok.
# bar graph of each song in top 50 and its count
fig, ax = plt.subplots()
plt.title("Top 50 Songs on Tiktok", fontsize=16)
X = tiktok_data["count"]
Y = tiktok_data["track_name"]
fig = sns.barplot(y = Y, x = X)
matplotlib.rcParams['figure.figsize'] = [20, 25]
# a line to indicate the mean count
avg = tiktok_data["count"].mean()
ax.axvline(avg, color="black", linewidth=2);
The most used song from our dataset, "Don't Start Now," has a usage count of 26, while the least used has 12, which is almost half of the leading song. From the graph, we can also conclude that only the top 17 songs (34% of the songs) are above the average count.
Since the Spotify dataset came with all the songs in ranking order, we do not need to take extra steps to reformat the data. However, we can still conlude that the most played song in 2021 is "drivers license" by Olivia Rodrigo.
We have all these attributes readily available, but are there any correlations between each one? Perhaps this could reveal what makes a song popular on each platform. To help answering these questions, I created a heatmap for both TikTok and Spotify's top 50 songs. Red indicates a strong correlation and blue indicates weak.
# creating tiktok heatmap
sns.set_theme(rc = {'figure.figsize':(13,13)})
c_map = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(tiktok_data.corr(), cmap=c_map, annot=True)
plt.show()
At first glance, the attributes of TikTok's songs have low correlations amongst each other for the most part. The strongest one consists of loudness and energy (0.65), whereas the weakest one lies within acousticness and energy (-0.48). This potentially means that the louder the song, the more energy it contains. Another notable relationship is danceability and valence, with a correlation of 0.41.
While observing the relationships of popularity, energy has the strongest correlation with this trait and tempo has the weakest. Such finding tells us that no matter what tempo, the more energetic --> loud it is, the more popular it could be on TikTok. This makes sense because dance videos consists a huge portion of this app, as this was the app's initial intent.
# creating spotify heatmap
sns.set_theme(rc = {'figure.figsize':(13,13)})
c_map = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(spotify_data.corr(), cmap=c_map, annot=True)
plt.show()
The strongest correlation here is between loudness and energy (0.75), while the weakest is between acousticness and energy (-0.68), identical to our observations from the TikTok heatmap. Another familiar correlation that has a substantial relationship between energy and loudness is valence. This could potentially mean that the happier the song, the more energetic and loud it is.
Focusing on the popularity column, we can see mode has the strongest correlation and danceability has the weakest. The mode in this data set represents the major or minor key, where zero is minor and major is one. These two variables may effect the popularity of a song on Spotify the most, and it is likely that a song with lower danceability in a major key will be more popular than a song with high danceability in a minor key.
So many attributes, so little time... So I decided to focus on analyzing the ones that seem more relevant according to the previous heatmaps. These include energy, loudness, danceability, and acousticness. Since the calculations of these values were unexplained when importing the dataset, the x-axis will not contain units. However, for these four attributes, they are consistent with the fact that higher value, the more [insert attribute] the song is.
# plot histogram for TikTok and Spotify: energy
fig, ax = plt.subplots();
# bars for energy ᕦ(ò_óˇ)
a_heights, a_bins = np.histogram(tiktok_data['energy']);
b_heights, b_bins = np.histogram(spotify_data['energy'], bins=a_bins);
width = (a_bins[1] - a_bins[0])/3
# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Energy', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Energy', fontsize=16);
ax.legend();
It appears that most trending songs on Tiktok do not have high energy, while songs from Spotify have average-high energy. This is mildly surprising because as aforementioned, dance videos make up a huge portion of videos on Tiktok. Thus, this leads me to believe that dance videos do not necessarily need high-energy songs.
# plot histogram for TikTok and Spotify: loudness
fig, ax = plt.subplots();
# bars for loud
a_heights, a_bins = np.histogram(tiktok_data['loudness']);
b_heights, b_bins = np.histogram(spotify_data['loudness'], bins=a_bins);
width = (a_bins[1] - a_bins[0])/3
# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Loudness', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Loudness', fontsize=16);
ax.legend();
Most songs from TikTok are loud, as the entire distribution is left-skewed. For Spotify, however, the distribution looks like a perfect bell-curve, meaning most of the songs are reasonably-volumed with little songs that are too loud or too soft.
Hmmmmmm
From the heatmap, loudness and energy are strongly correlated. But how strong? Perhaps a scatter plot can help with this, and we can even find the equation for line of best fit of this relationship to help us predict the loudness or energy of a song in the future.
# plot points and line for Tiktok
x = tiktok_data['loudness']
y = tiktok_data['energy']
plt.scatter(x, y, c='b', label='TikTok')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c='b')
eq1 = "TikTok Line of Best Fit: y[loudness] = "+str(m)+" * x[energy] + "+str(b)
# plot points and line for Spotify
x = spotify_data['loudness']
y = spotify_data['energy']
plt.scatter(x, y, c='purple', label='Spotify')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c='purple')
eq2 = "Spotify Line of Best Fit: y[loudness] = "+str(m)+" * x[energy] + "+str(b)
# set titles and labels
plt.title('Loudness vs. Energy', fontsize=16);
plt.ylabel('Loudneses', fontsize=16);
plt.xlabel('Energy', fontsize=16);
plt.legend(loc='upper left')
plt.show()
print(eq1)
print(eq2)
TikTok Line of Best Fit: y[loudness] = 0.03922673705836067 * x[energy] + 0.8775505849346418 Spotify Line of Best Fit: y[loudness] = 0.054131218234488826 * x[energy] + 0.9659463819589633
# plot histogram for TikTok and Spotify: danceability ヽ(⌐■_■)ノ♪♬
fig, ax = plt.subplots();
# bars for dance
a_heights, a_bins = np.histogram(tiktok_data['danceability']);
b_heights, b_bins = np.histogram(spotify_data['danceability'], bins=a_bins);
width = (a_bins[1] - a_bins[0])/3
# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Danceability', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Danceability', fontsize=16);
ax.legend();
The distribution for TikTok is once again left-skewed, meaning most songs are danceable. This is no surprise, considering that most songs from this app are also loud and we previously concluded that danceability positively correlates with loudness. The Spotify songs seem to be on the higher side of danceability as well, conveying the possibility that more danceable songs are trending.
# plot histogram for TikTok and Spotify: acousticness
fig, ax = plt.subplots();
# bars for acousticness
a_heights, a_bins = np.histogram(tiktok_data['acousticness']);
b_heights, b_bins = np.histogram(spotify_data['acousticness'], bins=a_bins);
width = (a_bins[1] - a_bins[0])/3
# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Acousticness', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Acousticness', fontsize=16);
ax.legend();
Lastly, I examined the attribute that correlates the least with all of the above. TikTok and Spotify seem to have a similar distribution, with most songs having low acousticness. It appears that not only Gen Z, but all generations these days prefer more upbeat, danceable songs!
Up to this point, we have been neglecting two very important components of music - tempo and time signature! Do we enjoy faster or slower songs nowadays? How about songs that sound "even" or "odd"? These attributes will help answering these questions.
# creating density plots
tiktok_data["tempo"].plot.kde(bw_method=0.15, c = 'blue', label = 'TikTok');
spotify_data["tempo"].plot.kde(bw_method=0.15, c = 'purple', label = 'Spotify');
# labels and title
plt.title('Tempo', fontsize=16);
plt.ylabel('Density', fontsize=16);
plt.xlabel('Tempo (Beats per Minute)', fontsize=16);
plt.legend(loc='upper left');
Given this density plot, most songs seem to lie within the range of 80 to 160 bpm, with a spike at around 170 bpm. The density of both apps seem to be similar, with most maximums and minimums present at the same place. However, TikTok seem to have a exceed Spotify's density for the most part. This indicates that Spotify's tempo is more spread out and has more range than TikTok's. In addition, both app's absolute maximum is around 130 bpm, with a local maximum at around 165 bpm.
# print stats
print("TikTok's Summary Statistics on Tempo:\n",tiktok_data["tempo"].describe())
print("Spotify's Summary Statistics on Tempo:\n",spotify_data["tempo"].describe())
TikTok's Summary Statistics on Tempo: count 50.000000 mean 119.439340 std 22.949968 min 71.994000 25% 100.787750 50% 119.934500 75% 132.315000 max 171.020000 Name: tempo, dtype: float64 Spotify's Summary Statistics on Tempo: count 50.000000 mean 121.083860 std 29.252206 min 72.017000 25% 98.655500 50% 120.516500 75% 138.532000 max 180.917000 Name: tempo, dtype: float64
I have printed the statistics for both apps to summarize and decipher our findings.
Mean: For both apps, the average tempo is around 120, so we like songs that aren't too fast or too slow.
Stdev: As expected, Spotify's standard deviation is greater than TikTok's, indicating a more spread out tempo data
Q1(25%) & Q3(75%):These characteristics of Spotify is a bigger range than TikTok's, as a result of a bigger spread
Median(50%): For both apps, the median is almost identical to the mean, which means the tempo data for both is almost symmetrical.
# print stats for time_signature (Spotify)
print("Spotify's Summary Statistics on Time Signature:\n",spotify_data["time_signature"].describe())
Spotify's Summary Statistics on Time Signature: count 50.000000 mean 3.960000 std 0.197949 min 3.000000 25% 4.000000 50% 4.000000 75% 4.000000 max 4.000000 Name: time_signature, dtype: float64
Unfortunately, TikTok's dataset did not come with time signature. No worries though, we can still see what people prefer on Spotify - and from the printed statistics, it seems like almost all the trending songs are in 4/4 time. We know this because event at the 25th percentile, the preferred time signature is 4/4. However, there is still one or two songs that are in 3/4, as indicated by the minimum. We like our songs to feel "even"!
Are there any particular artists that we like nowadays? Let's see who has more than just one trending song in the top 50. To help us visualize this better, I created separate dataframes to obtain the artists with the most songs on both apps.
# artists with most songs in TikTok
tiktok_data['Count']=1
tiktok_artist = tiktok_data.groupby('artist_name')['Count'].sum().reset_index().sort_values(by='Count',ascending=False)
tiktok_top_ten = tiktok_artist.head(10)
# artists with most songs in Spotify
spotify_data['Count']=1
spotify_artist = spotify_data.groupby('artist_name')['Count'].sum().reset_index().sort_values(by='Count',ascending=False)
spotify_top_ten = spotify_artist.head(10)
print("Left: TikTok | Right: Spotify")
pd.concat([d.reset_index(drop=True) for d in [tiktok_top_ten, spotify_top_ten]], axis=1)
Left: TikTok | Right: Spotify
artist_name | Count | artist_name | Count | |
---|---|---|---|---|
0 | Megan Thee Stallion | 2 | Doja Cat | 4 |
1 | Cardi B | 2 | Olivia Rodrigo | 4 |
2 | Doja Cat | 2 | Bad Bunny | 3 |
3 | Pop Smoke | 2 | Lil Nas X | 2 |
4 | 24kGoldn | 1 | BTS | 2 |
5 | Ritt Momney | 1 | The Weeknd | 2 |
6 | Lil Vinceyy | 1 | Dua Lipa | 2 |
7 | Mike Posner | 1 | The Kid LAROI | 2 |
8 | Monte Booker | 1 | Ariana Grande | 2 |
9 | Nelly Furtado | 1 | Måneskin | 2 |
For TikTok, the most artists with the most songs on the list are Megan Thee Stallion, Cardi B, Doja Cat, and Pop Smoke. All four artists are rappers, which leads me to believe that rap songs are more likely to trend on TikTok. However, despite their popularity, they all only have 2 songs on the list each, thus there is more artist variety on amongst the 50 songs.
As for Spotify, Doja Cat and Olivia Rodrigo won, with four songs each. The rest of the artists in the top ten list has about 2 to 3 songs each. This means that 25 songs belong to artists in the within the top ten! That's half of the top 50 songs! Compared to TikTok, Spotify's list has less variety musically, as the same artists tend to hit the charts. Props to Doja Cat for making both lists though!
For both TikTok and Spotify, most of their attributes we have explored have strikingly similarities between them. Although we cannot confirm that TikTok songs has influenced the trending songs on Spotify, we can see if lists share any songs.
# creating a dataframe that has common songs
common_songs = pd.merge(tiktok_data, spotify_data, on='track_name', how='inner')
# drop unnecessary columns
common_songs = common_songs[['track_name','artist_name_x']]
display(common_songs)
print(str(8/50*100), "% songs in common")
track_name | artist_name_x | |
---|---|---|
0 | Don't Start Now | Dua Lipa |
1 | drivers license | Olivia Rodrigo |
2 | Mood (feat. iann dior) | 24kGoldn |
3 | Peaches (feat. Daniel Caesar & Giveon) | Justin Bieber |
4 | 34+35 | Ariana Grande |
5 | Heartbreak Anniversary | Giveon |
6 | Blinding Lights | DJ Challenge X |
7 | Kiss Me More (feat. SZA) | Doja Cat |
16.0 % songs in common
Out of the top 50 songs for each app, there are 8 songs in common, not bad! We cannot draw any conclusions, however, about Gen Z influencing the music industry, as popular songs on TikTok could have been influenced by current trending songs as well.
As we analyzed and compared the trending songs from TikTok and Spotify, it is safe to generalize a few things about trending songs in this generation:
Through this analysis, I have learned much more about the songs in the TikTok subculture, as well as trending songs on Spotify. I now have a deeper understanding in why we prefer certain songs over others, and what attributes should be emphasized if I want to make a hit song one day...
Thank you for exploring with me, and I hope you learned something from this adventure as well.