A Musical Analysis on Generation Z (aka TikTok)
¶

Judy Song
¶

Introduction
¶

As music changes rapidly throughout the centuries, each generation's taste in music does as well. Today's modern technology provides exposure to music of different genre, language, and even culture. Thus, in order to analyze how this newly-introduced notion impacted the society's taste in music, the generation who grew up with the Internet, Generation Z, is the perfect candidate to use.

Generation Z refers to those who were born between 1997 – 2012 (ages 9 - 24). Although Spotify (a music streaming app) has ample data all of their songs, it is difficult to sift out secific data on Gen Z. What other resources can we turn to?

...

TikTok! This video-sharing app is dominated by Gen Z, as 70% of the users are between 9 - 24 years old. Although not all videos use songs in the background, for simplicity, we will only explore videos that contains songs.

Are you still interested in stats for Spotify? No worries, I will take this opporunity to dive into this app as well ᕦ(ò_óˇ)

Finally, I will take data from both apps to draw comparisons between these two, as well as see whether TikTok made an impact on trending songs in Spotify.

Data Collection & Management

Some necessary imports for this tutorial include: pandas, numpy, seaborn, matplotlib

In [623]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

TikTok Dataset¶

In order to get a more precise analysis on TikTok, I found a random sample set of TikTok dataset with around 6,500 TikTok videos from 2021 that contains a song here. Each entry contains attributes to that particular video, such as name of the track played, the artist of that track, duration, and so on. The data was readily available in a csv file, so I downloaded the file and loaded it into a dataframe. Because not every attribute will be used in our analysis, I dropped the columns that were irrelevant to clean up the data.

In [624]:
# load tiktok data into dataframe
tiktok_data = pd.read_csv('tiktok.csv')

# drop unnecessary columns
tiktok_data = tiktok_data.drop(['Unnamed: 0','track_id','artist_id','album_id',
                                'playlist_id','genre','playlist_name','duration_mins'], axis = 1)
tiktok_data.head()
Out[624]:
track_name artist_name duration release_date popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo
0 Lay It Down Gmix - Main Lloyd 302186 2011-01-01 28 0.597 0.800 1 -5.423 0 0.3120 0.0461 0.0 0.1800 0.565 155.932
1 Bartender (feat. Akon) T-Pain 238800 2007-06-05 75 0.832 0.391 8 -8.504 1 0.0628 0.0564 0.0 0.2240 0.436 104.961
2 Bartender (feat. Akon) T-Pain 238800 2007-06-05 75 0.832 0.391 8 -8.504 1 0.0628 0.0564 0.0 0.2240 0.436 104.961
3 Chosen (feat. Ty Dolla $ign) Blxst 161684 2020-12-04 76 0.571 0.767 2 -5.160 1 0.2870 0.3360 0.0 0.0809 0.605 93.421
4 Tie Me Down (with Elley Duhé) Gryffin 218295 2018-08-03 72 0.548 0.839 6 -2.371 1 0.0644 0.1350 0.0 0.1020 0.314 98.932

Spotify Dataset¶

As for the spotify data, I repeated the process with downloading the data as a csv file from here and putting it into a dataframe. However, the dataset itself is the top 50 songs in the USA in 2021. Therefore, the Spotify dataset we are exploring only contains 50 entries with attributes such as the song's title, artist, duration and so on as well. Similarly, I dropped the irrelevant columns to our analysis.

In [625]:
# load spotify data into dataframe
spotify_data = pd.read_csv('spotify_top50_2021.csv')

# drop unnecessary columns
spotify_data = spotify_data.drop(['id','track_id'], axis = 1)

spotify_data.head()
Out[625]:
artist_name track_name popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms time_signature
0 Olivia Rodrigo drivers license 92 0.561 0.431 10 -8.810 1 0.0578 0.76800 0.000014 0.1060 0.137 143.875 242013 4
1 Lil Nas X MONTERO (Call Me By Your Name) 90 0.593 0.503 8 -6.725 0 0.2200 0.29300 0.000000 0.4050 0.710 178.781 137704 4
2 The Kid LAROI STAY (with Justin Bieber) 92 0.591 0.764 1 -5.484 1 0.0483 0.03830 0.000000 0.1030 0.478 169.928 141806 4
3 Olivia Rodrigo good 4 u 95 0.563 0.664 9 -5.044 1 0.1540 0.33500 0.000000 0.0849 0.688 166.928 178147 4
4 Dua Lipa Levitating (feat. DaBaby) 89 0.702 0.825 6 -3.787 0 0.0601 0.00883 0.000000 0.0674 0.915 102.977 203064 4

Data Reorganization¶

The current TikTok dataframe merely contains entires in no particular order. To get an idea of this dataset visually, I compressed the dataframe by adding a column that contains the number of times a song was used in this dataset. This way, a song only takes up one entry in the dataframe, which reduced the number of entires by about half.

Using the newly made "count" column, I then sorted all the entries in descending count order, meaning the songs that are most used are at the top. Because we are analyzing the top 50 songs in Spotify, in order to get a closer comparison, I kept the top 50 songs in the TikTok dataframe and dropped the rest.

In [626]:
# update dataframe with new column 'count'
tiktok_data['count'] = tiktok_data['track_name'].map(tiktok_data['track_name'].value_counts())

# sort songs by most used to least used and remove duplicate entries
tiktok_data = tiktok_data.sort_values(by=['count'], ascending=False)
tiktok_data = tiktok_data.drop_duplicates(subset = "track_name")

# drop the rows that are not 50 most used
tiktok_data = tiktok_data.drop(tiktok_data.index[50:])
tiktok_data = tiktok_data.reset_index(drop=True)
tiktok_data.head()
Out[626]:
track_name artist_name duration release_date popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo count
0 Don't Start Now Dua Lipa 183290 2019-10-31 85 0.794 0.793 11 -4.521 0 0.0842 0.0125 0.000000 0.0952 0.677 123.941 26
1 What You Know Bout Love Pop Smoke 160000 2020-07-03 87 0.709 0.548 10 -8.493 1 0.3530 0.6500 0.000002 0.1330 0.543 83.995 24
2 OUT WEST (feat. Young Thug) JACKBOYS 157712 2019-12-27 83 0.802 0.591 8 -4.895 1 0.2250 0.0104 0.000000 0.1960 0.309 139.864 23
3 drivers license Olivia Rodrigo 242013 2021-01-08 94 0.585 0.436 10 -8.761 1 0.0601 0.7210 0.000013 0.1050 0.132 143.874 23
4 No Idea Don Toliver 154424 2019-05-29 5 0.651 0.631 6 -5.717 0 0.0896 0.5190 0.000579 0.1650 0.350 127.994 23

Dataset Info¶

Before diving in, because not everyone is familiar with the musical terminologies, I would like to provide an explanation of each attribute/column of the songs. The columns dropped, however, will not be featured in this section.

Duration - how long the song is, in milliseconds

Popularity - The higher the value the more popular the song is; the measure of this is precalcuated and provided by the dataset

Danceability - The higher the value, the easier it is to dance to this song; value is precalculated

Energy - he higher the value, the more energtic the song is; value is precalculated

Key - The key the song is in. Each integer is associated with a specific pitch in the standard Pitch Class notation. For example, 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1

Loudness - The higher the value, the louder the song; value is precalculated

Mode - Specifies the modality (major or minor) of a track, which is the type of scale from which its melodic content is derived. Major is represented by the value 1 and minor is 0

Speechiness - The higher the value the more spoken word the song contains; value is precalculated

Acousticness - The higher the value the more acoustic the song is; value is precalculated

Instrumentalness - the number of vocals in a song. The closer the value to 1.0, the more instrumental the song is

Liveness - The higher the value, the more likely the song is a live recording

Valence - The higher the value, the more positive mood for the song

Tempo - The overall tempo of a song in beats per minute (BPM), usually indicates how fast a song is

Time signature - An indication of rhythm, generally represented as a fraction with the denominator defining the beat as a division of a whole note and the numerator giving the number of beats in each bar.

Data Exploration and Visualization¶

TikTok Data: Revisualization¶

The bar graph below provides a visualization of the newly reformatted TikTok data. I also added a line that marks the average count to get an insight on the distribution of just the top 50 songs on TikTok.

In [627]:
# bar graph of each song in top 50 and its count
fig, ax = plt.subplots()
plt.title("Top 50 Songs on Tiktok", fontsize=16)
X = tiktok_data["count"]
Y = tiktok_data["track_name"]
fig = sns.barplot(y = Y, x = X)
matplotlib.rcParams['figure.figsize'] = [20, 25]

# a line to indicate the mean count
avg = tiktok_data["count"].mean()
ax.axvline(avg, color="black", linewidth=2);

The most used song from our dataset, "Don't Start Now," has a usage count of 26, while the least used has 12, which is almost half of the leading song. From the graph, we can also conclude that only the top 17 songs (34% of the songs) are above the average count.

Since the Spotify dataset came with all the songs in ranking order, we do not need to take extra steps to reformat the data. However, we can still conlude that the most played song in 2021 is "drivers license" by Olivia Rodrigo.

To Correlate or Not to Correlate?¶

We have all these attributes readily available, but are there any correlations between each one? Perhaps this could reveal what makes a song popular on each platform. To help answering these questions, I created a heatmap for both TikTok and Spotify's top 50 songs. Red indicates a strong correlation and blue indicates weak.

In [628]:
# creating tiktok heatmap
sns.set_theme(rc = {'figure.figsize':(13,13)})
c_map = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(tiktok_data.corr(), cmap=c_map, annot=True)
plt.show()

At first glance, the attributes of TikTok's songs have low correlations amongst each other for the most part. The strongest one consists of loudness and energy (0.65), whereas the weakest one lies within acousticness and energy (-0.48). This potentially means that the louder the song, the more energy it contains. Another notable relationship is danceability and valence, with a correlation of 0.41.

While observing the relationships of popularity, energy has the strongest correlation with this trait and tempo has the weakest. Such finding tells us that no matter what tempo, the more energetic --> loud it is, the more popular it could be on TikTok. This makes sense because dance videos consists a huge portion of this app, as this was the app's initial intent.

In [629]:
# creating spotify heatmap
sns.set_theme(rc = {'figure.figsize':(13,13)})
c_map = sns.diverging_palette(220, 20, as_cmap=True)
sns.heatmap(spotify_data.corr(), cmap=c_map, annot=True)
plt.show()

The strongest correlation here is between loudness and energy (0.75), while the weakest is between acousticness and energy (-0.68), identical to our observations from the TikTok heatmap. Another familiar correlation that has a substantial relationship between energy and loudness is valence. This could potentially mean that the happier the song, the more energetic and loud it is.

Focusing on the popularity column, we can see mode has the strongest correlation and danceability has the weakest. The mode in this data set represents the major or minor key, where zero is minor and major is one. These two variables may effect the popularity of a song on Spotify the most, and it is likely that a song with lower danceability in a major key will be more popular than a song with high danceability in a minor key.

Tiktok vs. Spotify¶

So many attributes, so little time... So I decided to focus on analyzing the ones that seem more relevant according to the previous heatmaps. These include energy, loudness, danceability, and acousticness. Since the calculations of these values were unexplained when importing the dataset, the x-axis will not contain units. However, for these four attributes, they are consistent with the fact that higher value, the more [insert attribute] the song is.

In [630]:
# plot histogram for TikTok and Spotify: energy
fig, ax = plt.subplots();

# bars for energy ᕦ(ò_óˇ)
a_heights, a_bins = np.histogram(tiktok_data['energy']);
b_heights, b_bins = np.histogram(spotify_data['energy'], bins=a_bins);

width = (a_bins[1] - a_bins[0])/3

# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Energy', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Energy', fontsize=16);
ax.legend();

It appears that most trending songs on Tiktok do not have high energy, while songs from Spotify have average-high energy. This is mildly surprising because as aforementioned, dance videos make up a huge portion of videos on Tiktok. Thus, this leads me to believe that dance videos do not necessarily need high-energy songs.

In [631]:
# plot histogram for TikTok and Spotify: loudness
fig, ax = plt.subplots();

# bars for loud
a_heights, a_bins = np.histogram(tiktok_data['loudness']);
b_heights, b_bins = np.histogram(spotify_data['loudness'], bins=a_bins);

width = (a_bins[1] - a_bins[0])/3

# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Loudness', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Loudness', fontsize=16);
ax.legend();

Most songs from TikTok are loud, as the entire distribution is left-skewed. For Spotify, however, the distribution looks like a perfect bell-curve, meaning most of the songs are reasonably-volumed with little songs that are too loud or too soft.

Hmmmmmm

From the heatmap, loudness and energy are strongly correlated. But how strong? Perhaps a scatter plot can help with this, and we can even find the equation for line of best fit of this relationship to help us predict the loudness or energy of a song in the future.

In [632]:
# plot points and line for Tiktok
x = tiktok_data['loudness']
y = tiktok_data['energy']
plt.scatter(x, y, c='b', label='TikTok')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c='b')
eq1 = "TikTok Line of Best Fit: y[loudness] = "+str(m)+" * x[energy] + "+str(b)
 
# plot points and line for Spotify
x = spotify_data['loudness']
y = spotify_data['energy']
plt.scatter(x, y, c='purple', label='Spotify')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c='purple')
eq2 = "Spotify Line of Best Fit: y[loudness] = "+str(m)+" * x[energy] + "+str(b)

# set titles and labels
plt.title('Loudness vs. Energy', fontsize=16);
plt.ylabel('Loudneses', fontsize=16);
plt.xlabel('Energy', fontsize=16);
plt.legend(loc='upper left')
plt.show()
In [633]:
print(eq1)
print(eq2)
TikTok Line of Best Fit: y[loudness] = 0.03922673705836067 * x[energy] + 0.8775505849346418
Spotify Line of Best Fit: y[loudness] = 0.054131218234488826 * x[energy] + 0.9659463819589633
In [634]:
# plot histogram for TikTok and Spotify: danceability ヽ(⌐■_■)ノ♪♬
fig, ax = plt.subplots();

# bars for dance
a_heights, a_bins = np.histogram(tiktok_data['danceability']);
b_heights, b_bins = np.histogram(spotify_data['danceability'], bins=a_bins);

width = (a_bins[1] - a_bins[0])/3

# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Danceability', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Danceability', fontsize=16);
ax.legend();

The distribution for TikTok is once again left-skewed, meaning most songs are danceable. This is no surprise, considering that most songs from this app are also loud and we previously concluded that danceability positively correlates with loudness. The Spotify songs seem to be on the higher side of danceability as well, conveying the possibility that more danceable songs are trending.

In [635]:
# plot histogram for TikTok and Spotify: acousticness
fig, ax = plt.subplots();

# bars for acousticness
a_heights, a_bins = np.histogram(tiktok_data['acousticness']);
b_heights, b_bins = np.histogram(spotify_data['acousticness'], bins=a_bins);

width = (a_bins[1] - a_bins[0])/3

# prettify the graph
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue', label='TikTok');
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='plum', label='Spotify');
fig.suptitle('Acousticness', fontsize=20);
plt.ylabel('Number of Songs', fontsize=16);
plt.xlabel('Level of Acousticness', fontsize=16);
ax.legend();

Lastly, I examined the attribute that correlates the least with all of the above. TikTok and Spotify seem to have a similar distribution, with most songs having low acousticness. It appears that not only Gen Z, but all generations these days prefer more upbeat, danceable songs!

Does Time Matter?¶

Up to this point, we have been neglecting two very important components of music - tempo and time signature! Do we enjoy faster or slower songs nowadays? How about songs that sound "even" or "odd"? These attributes will help answering these questions.

In [636]:
# creating density plots
tiktok_data["tempo"].plot.kde(bw_method=0.15, c = 'blue', label = 'TikTok');
spotify_data["tempo"].plot.kde(bw_method=0.15, c = 'purple', label = 'Spotify');

# labels and title
plt.title('Tempo', fontsize=16);
plt.ylabel('Density', fontsize=16);
plt.xlabel('Tempo (Beats per Minute)', fontsize=16);
plt.legend(loc='upper left');

Given this density plot, most songs seem to lie within the range of 80 to 160 bpm, with a spike at around 170 bpm. The density of both apps seem to be similar, with most maximums and minimums present at the same place. However, TikTok seem to have a exceed Spotify's density for the most part. This indicates that Spotify's tempo is more spread out and has more range than TikTok's. In addition, both app's absolute maximum is around 130 bpm, with a local maximum at around 165 bpm.

In [637]:
# print stats
print("TikTok's Summary Statistics on Tempo:\n",tiktok_data["tempo"].describe())
print("Spotify's Summary Statistics on Tempo:\n",spotify_data["tempo"].describe())
TikTok's Summary Statistics on Tempo:
 count     50.000000
mean     119.439340
std       22.949968
min       71.994000
25%      100.787750
50%      119.934500
75%      132.315000
max      171.020000
Name: tempo, dtype: float64
Spotify's Summary Statistics on Tempo:
 count     50.000000
mean     121.083860
std       29.252206
min       72.017000
25%       98.655500
50%      120.516500
75%      138.532000
max      180.917000
Name: tempo, dtype: float64

I have printed the statistics for both apps to summarize and decipher our findings.

Mean: For both apps, the average tempo is around 120, so we like songs that aren't too fast or too slow.

Stdev: As expected, Spotify's standard deviation is greater than TikTok's, indicating a more spread out tempo data

Q1(25%) & Q3(75%):These characteristics of Spotify is a bigger range than TikTok's, as a result of a bigger spread

Median(50%): For both apps, the median is almost identical to the mean, which means the tempo data for both is almost symmetrical.

In [638]:
# print stats for time_signature (Spotify)
print("Spotify's Summary Statistics on Time Signature:\n",spotify_data["time_signature"].describe())
Spotify's Summary Statistics on Time Signature:
 count    50.000000
mean      3.960000
std       0.197949
min       3.000000
25%       4.000000
50%       4.000000
75%       4.000000
max       4.000000
Name: time_signature, dtype: float64

Unfortunately, TikTok's dataset did not come with time signature. No worries though, we can still see what people prefer on Spotify - and from the printed statistics, it seems like almost all the trending songs are in 4/4 time. We know this because event at the 25th percentile, the preferred time signature is 4/4. However, there is still one or two songs that are in 3/4, as indicated by the minimum. We like our songs to feel "even"!

Popularity Contest¶

Are there any particular artists that we like nowadays? Let's see who has more than just one trending song in the top 50. To help us visualize this better, I created separate dataframes to obtain the artists with the most songs on both apps.

In [639]:
# artists with most songs in TikTok
tiktok_data['Count']=1
tiktok_artist = tiktok_data.groupby('artist_name')['Count'].sum().reset_index().sort_values(by='Count',ascending=False)
tiktok_top_ten = tiktok_artist.head(10)

# artists with most songs in Spotify
spotify_data['Count']=1
spotify_artist = spotify_data.groupby('artist_name')['Count'].sum().reset_index().sort_values(by='Count',ascending=False)
spotify_top_ten = spotify_artist.head(10)
print("Left: TikTok | Right: Spotify")
pd.concat([d.reset_index(drop=True) for d in [tiktok_top_ten, spotify_top_ten]], axis=1)
Left: TikTok | Right: Spotify
Out[639]:
artist_name Count artist_name Count
0 Megan Thee Stallion 2 Doja Cat 4
1 Cardi B 2 Olivia Rodrigo 4
2 Doja Cat 2 Bad Bunny 3
3 Pop Smoke 2 Lil Nas X 2
4 24kGoldn 1 BTS 2
5 Ritt Momney 1 The Weeknd 2
6 Lil Vinceyy 1 Dua Lipa 2
7 Mike Posner 1 The Kid LAROI 2
8 Monte Booker 1 Ariana Grande 2
9 Nelly Furtado 1 Måneskin 2

For TikTok, the most artists with the most songs on the list are Megan Thee Stallion, Cardi B, Doja Cat, and Pop Smoke. All four artists are rappers, which leads me to believe that rap songs are more likely to trend on TikTok. However, despite their popularity, they all only have 2 songs on the list each, thus there is more artist variety on amongst the 50 songs.

As for Spotify, Doja Cat and Olivia Rodrigo won, with four songs each. The rest of the artists in the top ten list has about 2 to 3 songs each. This means that 25 songs belong to artists in the within the top ten! That's half of the top 50 songs! Compared to TikTok, Spotify's list has less variety musically, as the same artists tend to hit the charts. Props to Doja Cat for making both lists though!

Trendsetter Gen Z?¶

For both TikTok and Spotify, most of their attributes we have explored have strikingly similarities between them. Although we cannot confirm that TikTok songs has influenced the trending songs on Spotify, we can see if lists share any songs.

In [640]:
# creating a dataframe that has common songs
common_songs = pd.merge(tiktok_data, spotify_data, on='track_name', how='inner')

# drop unnecessary columns
common_songs = common_songs[['track_name','artist_name_x']]

display(common_songs)
print(str(8/50*100), "% songs in common")
track_name artist_name_x
0 Don't Start Now Dua Lipa
1 drivers license Olivia Rodrigo
2 Mood (feat. iann dior) 24kGoldn
3 Peaches (feat. Daniel Caesar & Giveon) Justin Bieber
4 34+35 Ariana Grande
5 Heartbreak Anniversary Giveon
6 Blinding Lights DJ Challenge X
7 Kiss Me More (feat. SZA) Doja Cat
16.0 % songs in common

Out of the top 50 songs for each app, there are 8 songs in common, not bad! We cannot draw any conclusions, however, about Gen Z influencing the music industry, as popular songs on TikTok could have been influenced by current trending songs as well.

Conclusion
¶

As we analyzed and compared the trending songs from TikTok and Spotify, it is safe to generalize a few things about trending songs in this generation:

  • The louder the song, the more energetic. Conversely, the more acoustic, the less energetic.
  • We tend to attract more upbeat and danceable songs.
  • However, Spotify users tend to enjoy louder songs than TikTok users.
  • The society as a whole is not a fan of acoustic songs.
  • No matter the genre, we like our songs in a "perfect" pace of 120 bpm.
  • TikTok, or most of Gen Z, really enjoys rap artists, with our favorite being Doja Cat.
  • We enjoy songs in a comfortable count of 4/4 time.
  • TikTok's trending songs have more variety.
  • Through this analysis, I have learned much more about the songs in the TikTok subculture, as well as trending songs on Spotify. I now have a deeper understanding in why we prefer certain songs over others, and what attributes should be emphasized if I want to make a hit song one day...

    Thank you for exploring with me, and I hope you learned something from this adventure as well.