Acoustic Feature Learning

Overview

While there have been many advances in machine learning approaches for estimating human emotional response to music, very little progress has been made in terms of compact or intuitive feature representations. Current methods typically focus on combining several feature domains (e.g. loudness, timbre, harmony, rhythm), oftentimes as many as possible, followed by feature selection and dimensionality reduction techniques. While these methods can lead to enhanced classification performance, they leave much to be desired in terms of understanding the complex relationship between acoustic content and emotional associations. In this work, we employ regression-based deep belief networks to learn representations of music audio that are specifically optimized for the prediction of emotion.

In the below audio examples we investigate the reconstruction of features learned using this approach. The first example is simply the original audio from which features are extracted. As the input to our DBN is magnitude spectra, it is not possible to perfectly reconstruct the audio from the DBN inputs; the second example demonstrates the best-case scenario for audio reconstruction from magnitude spectra. Finally, the third example demonstrates the audio reconstructed from features extracted by the first DBN layer.

Audio Examples

Original Audio
Magnitude Reconstruction (DBN Input)
Layer 1 Feature Reconstruction


Relevant Publications:


  • Schmidt, E. M., Scott, J., and Kim, Y. E. (2012). Feature Learning in Dynamic Environments: Modeling the Acoustic Structure of Musical Emotion. Proceedings of the 2012 International Society for Music Information Retrieval Conference, Porto, Portugal: ISMIR. [PDF]

  • Schmidt, E. M. and Kim, Y. E. (2011). Modeling the acoustic structure of musical emotion with deep belief networks. NIPS Workshop on Music and Machine Learning, Sierra Nevada, Spain: NIPS-MML. [Oral Presentation]

  • Schmidt, E. M. and Kim, Y. E. (2011). Learning emotion-based acoustic features with deep belief networks. Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY: WASPAA. [PDF]