CS 4641 Project

Midterm Report

Introduction

In today’s world, the average consumer has access to millions of songs at the simple click of a button, a catalog that includes songs that span across eras, artistic expressions, and themes. While this presents music enthusiasts with an exciting opportunity to get more involved with their work, it also presents many challenges when it comes to organizing, classifying, and recommending songs to individuals with varied tastes.

Our team is impressed by the amount of research that has been done in this area, and wants to add to this literature. We have located several similar studies that each test different models, such as Convolutional Neural Networks (CNNs) [1] and Support Vector Machines [2]. We hope to compare and contrast more types of models to create a better genre classification model.

We plan to accomplish this by analyzing a dataset containing 1700 spectrograms derived from songs that are 270-300 seconds in length, and sorted into a hierarchy containing 3 levels of classifications and 16 distinct genres. Due to limitations in the dataset, our research will mainly be limited to English songs.

Problem Definition & Motivation

This project aims to classify the genres of songs based on their spectrograms. Using image processing techniques to redefine visual data into numeric data. and extract useful features to classify the music genre. We have three subtasks for classification:

First-level classification: Broadly distinguishes between classical and non-classical music.
Second-level classification: specific genres within the first level, such as classifying a piece as a symphony within the classical genre or as pop within the non-classical genre.
Third-level classification: Provides finer subcategories such as Teen_pop or pop_vocal_ballads in pop Categories.

In developing these classifications, we hope this project can lead into helping to identify trends within genres, which would provide valuable insights for artists, labels, and consumers.

Data Collection

Our team utilized the pre-existing dataset consisting of spectrogram images. Our dataset can be found here: https://huggingface.co/datasets/ccmusic-database/music_genre. We refined this dataset with a feature selection algorithm.

The original dataset consists of 1,700 spectrogram images from songs lasting between 270 and 300 seconds. These images, sized at 349x476 pixels, are in JPEG format and represent the spectrograms of music files. Created using Fast Fourier Transform (FFT), these spectrograms map the sound frequencies and intensities over time.

In each spectrogram:

The X-axis indicates time.
The Y-axis shows frequency.
The brightness of each pixel reflects the intensity of sound waves at specific frequencies and times.

Human voices and musical instruments produce unique waveforms. In a spectrogram, these waveforms are decomposed into various frequency sound waves, forming distinct patterns along the Y-axis. The presence and duration of each instrument or voice in the music are captured along the X-axis.

The dataset is categorized into three hierarchical levels:

First-level classification: Broadly distinguishes between classical and non-classical music.
Second-level classification: specific genres within the first level, such as classifying a piece as a symphony within the classical genre or as pop within the non-classical genre.
Third-level classification: Provides finer subcategories such as Teen_pop or pop_vocal_ballads in pop Categories.

First we transformed the image file into gray scale since color does not have any information but intensity does. After that we transformed to Gray scale co-occurance matrix(GLCM)[5] from the image to select contrast, dissimilarity, homogeneity, energy and correlation features.

Data Preprocessing (Feature Creation & Selection):

Given that we had image data, we needed to convert it into a format that works with traditional Machine Learning algorithms like K-Means. In order to do this, we used PIL and Numpy to convert the image into an array of floats. We then extracted several features from this array of floats, including the mean value, contrast, dissimilarity, homogenity, energy, correlation.

Methods

The methods used are Principal Component Analysis (PCA), and the unsupervised learning model we used was K-Means clustering. Using K-Means helps identify patterns within the dataset, which will help us find similarities and differences across genres. We used this method because it efficiently clusters large datasets, and works with grouping similar spectrograms. In this project we used PCA for visualization purposes, rather than feature selection. By reducing the high dimensional spectrogram data to three dimensions. PCA enables the effective visual representation of the data. This helps in our understanding of the underlying structures and relationships. PCA works by transforming the data into a smaller number of uncorrelated variables or principal components. The combination of PCA and K-Means helps us uncover patterns within the music genres, which can be used in the future for adding music recommendation systems, trend analysis, etc.

Quantitative Metrics

As can be seen in Figure 1 below, the purity scores for the eight clusters were lackluster, with four of the clusters having scores near 0.2 and all overall average purity of just above 0.25. We can note silhouette scores for varying number of clusters: Scores

We see the best clustering performance when dividing into two clusters (ideally classical and non-classical), but even with eight clusters there is still a relatively poor performance.

Analysis

As can be seen in Figure 2 down below (ground truth), we see some distinctions between the classical genres (symphony, opera, solo, and chamber) and the non-classical (pop, dance & house, soul/R&B, and rock), with the former occupying mostly the closer bottom-right octant of the graph in the displayed orientation. The latter genres seem to have a much larger spread, which is not surprising given the larger range in musical techniques and styles among these genres when compared to classical music. This visualization was generated using PCA which reduced the total number of features down to three from over 700.

As discussed above in the quantitative metrics section, we saw that the clustering generally performed pretty poorly, with low purity scores and a low silhouette score even for just two clusters. Unsurprisingly, we do not see any trends within the clusters as in Figure 3 there is no semblance of grouping of the clusters that reflects the distinctions we noted above. As we will discuss in the next section, there is significant room for growth in our feature selection, which will hopefully help with both the supervised learning methods and also improving our clustering.

Visualizations

Purity Ground Truth KMeans

Future Work

There is still significant room for improvement to make this project a viable genre classifier. Going forward, we plan to focus our work in two main areas: improving our feature selection process and improving our models.

Right now, our feature selection is quite limited, and does not do a great job of separating the data into distinct genre classification. Based on our current research, we have decided to experiment with using histograms of oriented gradients (HoG)[6] for our classification. This is a common feature reduction process used in image recognition, which we hope will translate well to the spectrogram images included in the dataset. However, this will also greatly increase complexity compared to our current features, so we may need to use PCA[7] to reduce the dimensions of the HoG features further.

Furthermore, we are going to branch into using supervised learning techniques for the second half of this project. These techniques will take advantage of the fact that our data is already labeled, and use these labels to better learn how to classify the songs. So far, we have identified five different models we would like to experiment with: logistic regression, support vector machines (SVMs), random forests, gradient boosting, and convolutional neural networks (CNNs). Each of these should give dramatically better results than our current clustering techniques.

References

N. M R and S. Mohan B S, “Music Genre Classification using Spectrograms,” 2020 International Conference on Power, Instrumentation, Control and Computing (PICC), Thrissur, India, 2020, pp. 1-5, doi: 10.1109/PICC51425.2020.9362364.
Costa, Yandre & Soares de Oliveira, Luiz & Koericb, A.L. & Gouyon, Fabien. (2011). Music genre recognition using spectrograms. Intl. Conf. on Systems, Signal and Image Processing. 1 - 4.
M. Dong, ‘Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification’, CoRR, vol. abs/1802.09697, 2018.
Zhaorui Liu and Zijin Li, “Music Data Sharing Platform for Computational Musicology Research (CCMUSIC DATASET).” Zenodo, Nov. 12, 2021. doi: 10.5281/ZENODO.5676893.
M. Hall-Beyer, 2007. GLCM Texture: A Tutorial https://prism.ucalgary.ca/handle/1880/51900 DOI:10.11575/PRISM/33280
V. Bisot, S. Essid and G. Richard, “HOG and subband power distribution image features for acoustic scene classification,” 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 2015, pp. 719-723, doi: 10.1109/EUSIPCO.2015.7362477.
Y. Panagakis, C. Kotropoulos and G. R. Arce, “Non-Negative Multilinear Principal Component Analysis of Auditory Temporal Modulations for Music Genre Classification,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 576-588, March 2010, doi: 10.1109/TASL.2009.2036813.

Proposed Timeline

This timeline is subject to change. A more detailed version can be found here.

TASK TITLE	TASK OWNER	START DATE	DUE DATE
Project Proposal
Introduction & Background	James DiPrimo	9/27/2023	10/6/2023
Problem Definition	Anirudh Ramesh	9/27/2023	10/6/2023
Methods	Siddhant Dubey	9/27/2023	10/6/2023
Timeline	Soongeol Kang	9/27/2023	10/6/2023
Potential Results & Discussion	Joseph Campbell	9/27/2023	10/6/2023
Video Recording	Siddhant Dubey	9/27/2023	10/6/2023
GitHub Page	Siddhant Dubey	9/27/2023	10/6/2023
Model 1 (GMM or K-means)
Data Sourcing and Cleaning	Siddhant Dubey and Anirudh Ramesh	10/7/2023	10/13/2023
Model Selection	Everyone	10/13/2023	10/16/2023
Data Pre-Processing	Siddhant Dubey and Anirudh Ramesh	10/16/2023	10/23/2023
Model Coding	Siddhant Dubey and Anirudh Ramesh	10/23/2023	10/30/2023
Visualizations	James	10/30/2023	11/2/2023
Quantitative Matrics, Analysis (Results Evaluation)	Joseph	10/30/2023	11/2/2023
Describe data set, revise, update timeline/contribution table (Mid report)	Kang	10/31/2023	11/3/2023
Midterm Report	Everyone	10/31/2023	11/3/2023
Model 2 (CNN)
Model Coding	Siddhant Dubey and Anirudh Ramesh	10/28/2023	11/4/2023
Results Evaluation	Everyone	11/5/2023	11/8/2023
Visualizations	James	11/5/2023	11/8/2023
Quantitative Matrics, Analysis (Results Evaluation)	Joseph	11/5/2023	11/8/2023
Describe data set, revise, update timeline/contribution table (Mid report)	Kang	11/6/2023	11/9/2023
Analysis	Joseph	11/6/2023	11/9/2023
Model 3 (SVMs)
Midterm Report	Everyone	11/3/2023	11/11/2023
Model Coding	Soongeol Kang, James DiPrimo	11/11/2023	11/18/2023
Results Evaluation	Anirudh Ramesh	11/18/2023	11/21/2023
Analysis	Joseph Campbell	11/19/2023	11/22/2023
Model 4 (Random Forests)
Model Coding	James DiPrimo, Anirudh Ramesh	11/15/2023	11/22/2023
Results Evaluation	Siddhant Dubey	11/20/2023	11/23/2023
Analysis	Soongeol Kang	11/21/2023	11/24/2023
Evaluation
Model Comparison	Everyone	11/29/2021	12/4/2021
Presentation	Everyone	12/1/2023	12/6/2023
Recording	Everyone	12/6/2021	12/7/2021
Final Report	Everyone	12/2/2021	12/8/2021

Contribution Table

Contribution	Person
Introduction	James
Problem Statement	Anirudh
Methods	Siddhant
Potential Results	Joseph
Proposed Timeline	Soongeol
Finding Datasets	Everyone
Finding Papers	Everyone

Contribution 2	Person
pick data pre processing method	Anirudh, Siddhant
implement algorithm	Anirudh, Siddhant
quantitative metrics	Joseph
analysis of algorithm	Joseph
visualizations	James
Next steps	James
describe data set	Soongeol
revise references, problem motivation and identification	Soongeol
update timeline/contribution table	Soongeol
results	Everyone