CS 4641 Project

Midterm Report

Introduction

In today’s world, the average consumer has access to millions of songs at the simple click of a button, a catalog that includes songs that span across eras, artistic expressions, and themes. While this presents music enthusiasts with an exciting opportunity to get more involved with their work, it also presents many challenges when it comes to organizing, classifying, and recommending songs to individuals with varied tastes.

Our team is impressed by the amount of research that has been done in this area, and wants to add to this literature. We have located several similar studies that each test different models, such as Convolutional Neural Networks (CNNs) [1] and Support Vector Machines [2]. We hope to compare and contrast more types of models to create a better genre classification model.

We plan to accomplish this by analyzing a dataset containing 1700 spectrograms derived from songs that are 270-300 seconds in length, and sorted into a hierarchy containing 3 levels of classifications and 16 distinct genres. Due to limitations in the dataset, our research will mainly be limited to English songs.

Problem Definition & Motivation

This project aims to classify the genres of songs based on their spectrograms. Using image processing techniques to redefine visual data into numeric data. and extract useful features to classify the music genre. We have three subtasks for classification:

  1. First-level classification: Broadly distinguishes between classical and non-classical music.
  2. Second-level classification: specific genres within the first level, such as classifying a piece as a symphony within the classical genre or as pop within the non-classical genre.
  3. Third-level classification: Provides finer subcategories such as Teen_pop or pop_vocal_ballads in pop Categories.

In developing these classifications, we hope this project can lead into helping to identify trends within genres, which would provide valuable insights for artists, labels, and consumers.

Data Collection

Our team utilized the pre-existing dataset consisting of spectrogram images. Our dataset can be found here: https://huggingface.co/datasets/ccmusic-database/music_genre. We refined this dataset with a feature selection algorithm.

The original dataset consists of 1,700 spectrogram images from songs lasting between 270 and 300 seconds. These images, sized at 349x476 pixels, are in JPEG format and represent the spectrograms of music files. Created using Fast Fourier Transform (FFT), these spectrograms map the sound frequencies and intensities over time.

In each spectrogram:

Human voices and musical instruments produce unique waveforms. In a spectrogram, these waveforms are decomposed into various frequency sound waves, forming distinct patterns along the Y-axis. The presence and duration of each instrument or voice in the music are captured along the X-axis.

The dataset is categorized into three hierarchical levels:

  1. First-level classification: Broadly distinguishes between classical and non-classical music.
  2. Second-level classification: specific genres within the first level, such as classifying a piece as a symphony within the classical genre or as pop within the non-classical genre.
  3. Third-level classification: Provides finer subcategories such as Teen_pop or pop_vocal_ballads in pop Categories.

First we transformed the image file into gray scale since color does not have any information but intensity does. After that we transformed to Gray scale co-occurance matrix(GLCM)[5] from the image to select contrast, dissimilarity, homogeneity, energy and correlation features.

Data Preprocessing (Feature Creation & Selection):

Given that we had image data, we needed to convert it into a format that works with traditional Machine Learning algorithms like K-Means. In order to do this, we used PIL and Numpy to convert the image into an array of floats. We then extracted several features from this array of floats, including the mean value, contrast, dissimilarity, homogenity, energy, correlation.

Methods

The methods used are Principal Component Analysis (PCA), and the unsupervised learning model we used was K-Means clustering. Using K-Means helps identify patterns within the dataset, which will help us find similarities and differences across genres. We used this method because it efficiently clusters large datasets, and works with grouping similar spectrograms. In this project we used PCA for visualization purposes, rather than feature selection. By reducing the high dimensional spectrogram data to three dimensions. PCA enables the effective visual representation of the data. This helps in our understanding of the underlying structures and relationships. PCA works by transforming the data into a smaller number of uncorrelated variables or principal components. The combination of PCA and K-Means helps us uncover patterns within the music genres, which can be used in the future for adding music recommendation systems, trend analysis, etc.

Quantitative Metrics

As can be seen in Figure 1 below, the purity scores for the eight clusters were lackluster, with four of the clusters having scores near 0.2 and all overall average purity of just above 0.25. We can note silhouette scores for varying number of clusters: Scores

We see the best clustering performance when dividing into two clusters (ideally classical and non-classical), but even with eight clusters there is still a relatively poor performance.

Analysis

As can be seen in Figure 2 down below (ground truth), we see some distinctions between the classical genres (symphony, opera, solo, and chamber) and the non-classical (pop, dance & house, soul/R&B, and rock), with the former occupying mostly the closer bottom-right octant of the graph in the displayed orientation. The latter genres seem to have a much larger spread, which is not surprising given the larger range in musical techniques and styles among these genres when compared to classical music. This visualization was generated using PCA which reduced the total number of features down to three from over 700.

As discussed above in the quantitative metrics section, we saw that the clustering generally performed pretty poorly, with low purity scores and a low silhouette score even for just two clusters. Unsurprisingly, we do not see any trends within the clusters as in Figure 3 there is no semblance of grouping of the clusters that reflects the distinctions we noted above. As we will discuss in the next section, there is significant room for growth in our feature selection, which will hopefully help with both the supervised learning methods and also improving our clustering.

Visualizations

Purity Ground Truth KMeans

Future Work

There is still significant room for improvement to make this project a viable genre classifier. Going forward, we plan to focus our work in two main areas: improving our feature selection process and improving our models.

Right now, our feature selection is quite limited, and does not do a great job of separating the data into distinct genre classification. Based on our current research, we have decided to experiment with using histograms of oriented gradients (HoG)[6] for our classification. This is a common feature reduction process used in image recognition, which we hope will translate well to the spectrogram images included in the dataset. However, this will also greatly increase complexity compared to our current features, so we may need to use PCA[7] to reduce the dimensions of the HoG features further.

Furthermore, we are going to branch into using supervised learning techniques for the second half of this project. These techniques will take advantage of the fact that our data is already labeled, and use these labels to better learn how to classify the songs. So far, we have identified five different models we would like to experiment with: logistic regression, support vector machines (SVMs), random forests, gradient boosting, and convolutional neural networks (CNNs). Each of these should give dramatically better results than our current clustering techniques.

References

  1. N. M R and S. Mohan B S, “Music Genre Classification using Spectrograms,” 2020 International Conference on Power, Instrumentation, Control and Computing (PICC), Thrissur, India, 2020, pp. 1-5, doi: 10.1109/PICC51425.2020.9362364.
  2. Costa, Yandre & Soares de Oliveira, Luiz & Koericb, A.L. & Gouyon, Fabien. (2011). Music genre recognition using spectrograms. Intl. Conf. on Systems, Signal and Image Processing. 1 - 4.
  3. M. Dong, ‘Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification’, CoRR, vol. abs/1802.09697, 2018.
  4. Zhaorui Liu and Zijin Li, “Music Data Sharing Platform for Computational Musicology Research (CCMUSIC DATASET).” Zenodo, Nov. 12, 2021. doi: 10.5281/ZENODO.5676893.
  5. M. Hall-Beyer, 2007. GLCM Texture: A Tutorial https://prism.ucalgary.ca/handle/1880/51900 DOI:10.11575/PRISM/33280
  6. V. Bisot, S. Essid and G. Richard, “HOG and subband power distribution image features for acoustic scene classification,” 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 2015, pp. 719-723, doi: 10.1109/EUSIPCO.2015.7362477.
  7. Y. Panagakis, C. Kotropoulos and G. R. Arce, “Non-Negative Multilinear Principal Component Analysis of Auditory Temporal Modulations for Music Genre Classification,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 576-588, March 2010, doi: 10.1109/TASL.2009.2036813.

Proposed Timeline

This timeline is subject to change. A more detailed version can be found here.

TASK TITLETASK OWNERSTART DATEDUE DATE
Project Proposal   
Introduction & BackgroundJames DiPrimo9/27/202310/6/2023
Problem DefinitionAnirudh Ramesh9/27/202310/6/2023
MethodsSiddhant Dubey9/27/202310/6/2023
TimelineSoongeol Kang9/27/202310/6/2023
Potential Results & DiscussionJoseph Campbell9/27/202310/6/2023
Video RecordingSiddhant Dubey9/27/202310/6/2023
GitHub PageSiddhant Dubey9/27/202310/6/2023
Model 1 (GMM or K-means)   
Data Sourcing and CleaningSiddhant Dubey and Anirudh Ramesh10/7/202310/13/2023
Model SelectionEveryone10/13/202310/16/2023
Data Pre-ProcessingSiddhant Dubey and Anirudh Ramesh10/16/202310/23/2023
Model CodingSiddhant Dubey and Anirudh Ramesh10/23/202310/30/2023
VisualizationsJames10/30/202311/2/2023
Quantitative Matrics, Analysis (Results Evaluation)Joseph10/30/202311/2/2023
Describe data set, revise, update timeline/contribution table (Mid report)Kang10/31/202311/3/2023
Midterm ReportEveryone10/31/202311/3/2023
Model 2 (CNN)   
Model CodingSiddhant Dubey and Anirudh Ramesh10/28/202311/4/2023
Results EvaluationEveryone11/5/202311/8/2023
VisualizationsJames11/5/202311/8/2023
Quantitative Matrics, Analysis (Results Evaluation)Joseph11/5/202311/8/2023
Describe data set, revise, update timeline/contribution table (Mid report)Kang11/6/202311/9/2023
AnalysisJoseph11/6/202311/9/2023
Model 3 (SVMs)   
Midterm ReportEveryone11/3/202311/11/2023
Model CodingSoongeol Kang, James DiPrimo11/11/202311/18/2023
Results EvaluationAnirudh Ramesh11/18/202311/21/2023
AnalysisJoseph Campbell11/19/202311/22/2023
Model 4 (Random Forests)   
Model CodingJames DiPrimo, Anirudh Ramesh11/15/202311/22/2023
Results EvaluationSiddhant Dubey11/20/202311/23/2023
AnalysisSoongeol Kang11/21/202311/24/2023
Evaluation   
Model ComparisonEveryone11/29/202112/4/2021
PresentationEveryone12/1/202312/6/2023
RecordingEveryone12/6/202112/7/2021
Final ReportEveryone12/2/202112/8/2021

Contribution Table

ContributionPerson
IntroductionJames
Problem StatementAnirudh
MethodsSiddhant
Potential ResultsJoseph
Proposed TimelineSoongeol
Finding DatasetsEveryone
Finding PapersEveryone
Contribution 2Person
pick data pre processing methodAnirudh, Siddhant
implement algorithmAnirudh, Siddhant
quantitative metricsJoseph
analysis of algorithmJoseph
visualizationsJames
Next stepsJames
describe data setSoongeol
revise references, problem motivation and identificationSoongeol
update timeline/contribution tableSoongeol
resultsEveryone