Predicting Time-Varying Emotion Space Heatmaps

Human emotion responses to music are dynamic processes that evolve naturally over time in synchrony with the music. It is because of this dynamic nature that systems which seek to predict emotion in music must necessarily analyze such processes on short-time intervals, modeling not just the relationships between acoustic data and emotion parameters, but how those relationships evolve over time. In this work we seek to model such relationships using a conditional random field (CRF), a powerful graphical model which is trained to predict the conditional probability p(y|x) for a sequence of labels y given a sequence of features x. Treating our features as deterministic, we retain the rich local subtleties present in the data, which is especially applicable to content-based audio analysis, given the abundance of data in these problems. We train our graphical model on the emotional responses of individual annotators in an 11x11 quantized representation of the arousal-valence (A-V) space. Our model is fully connected, and can produce estimates of the conditional probability for each A-V bin, allowing us to easily model complex emotion-space distributions (e.g. multimodal) as an A-V heatmap.


Examples using the MoodSwings Turk dataset:


    MoodSwings Turk is an open source dataset for the development of music emotion recognition systems.


    Examples using the instrumental dataset:



      Relevant Work:


      • Schmidt, E. M. and Kim, Y. E. (2011). Modeling the acoustic structure of musical emotion with deep belief networks. NIPS Workshop on Music and Machine Learning, Sierra Nevada, Spain: NIPS-MML. [Oral Presentation]

      • Schmidt, E. M. and Kim, Y. E. (2011). Modeling musical emotion dynamics with conditional random fields. Proceedings of the 2011 International Society for Music Information Retrieval Conference, Miami, Florida: ISMIR. [PDF]