MoodSwings Turk Dataset


    In previous work we designed MoodSwings, a collaborative online game that leverages crowdsourcing to collect mood ratings. The game board is based on the Arousal-Valence (A-V) space, where the valence dimension represents positive versus negative emotions and arousal represents high versus low energy. Anonymously-partnered players label song clips together during each round, scoring points based on the overlap between their cursors, which encourages consensus. Bonus points are awarded to a player whose partner moves towards him/her, encouraging competition and discouraging players from blindly following their partners to score points. We recently initiated a redesign effort, investigating game- play improvements suggested by an analysis of collected labels. However, we have not addressed concerns about the game structure biasing annotations.

To address these questions we designed simplified labeling task for Mechanical Turk. Single workers provide A-V labels for clips from our dataset, consisting of 240 15-second clips, which are extended to 30 seconds in the annotation task to give workers additional practice. As in MoodSwings, we collect per-second labels, but no partner is present and no points are awarded. Workers are given detailed instructions describing the A-V space. They navigate to a website which hosts the task and label 11 randomly-chosen clips. The first clip is a practice round, omitted from our analysis. The third and ninth are identical, randomly chosen from a set of 10 “verification clips,” which are evaluated to identify unsatisfactory work. Workers are given a 6-digit verification code to enter on the MTurk website as proof of completion which, if successful, earns workers $0.25 per HIT.

We are happy to share both the collected labels as well as a collection of acoustic features that were extracted for each song. If you decide to use this dataset in your work we kindly ask that you cite one of the following papers:

  • Schmidt, E. M. and Kim, Y. E. (2011). Modeling musical emotion dynamics with conditional random fields. Proceedings of the 2011 International Society for Music Information Retrieval Conference, Miami, Florida: ISMIR. [PDF]

  • Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study of collaborative vs. traditional annotation methods. Proceedings of the 2011 International Society for Music Information Retrieval Conference, Miami, Florida: ISMIR. [PDF]

  • Arousal-Valence Labels:

      The Arousal-Valence data is available for download in a MATLAB structure which for each song contains fields with the following data:

    artist: Name of the artist
    album: Name of the album
    song: Name of the song
    songid: The Song ID. These values can be used to obtain the corresponding features.
    userid: A numerical id number assigned to a Turker that is unique to a labeling set of 11 songs (rounds)
    round: The round number (1-11) that the song was annotated in. Note: 1 is practice, and 3 and 9 are verification so they will not appear.
    time: The time (in seconds) in the song at which the valence and arousal labels correspond to
    valence: The valence values for that song
    arousal: The arousal values for that song

    Click Here To Download the Dataset

    Acoustic Features:

      We are also providing features for each of the 240 songs used in our MoodSwings Turk dataset. The zip file attached below contains a mat file named with the song id number of each song. Each mat file contains a data structure with a field for each of the features described below. All audio decimated to a sampling rate of 22050 before feature computation. Features are provided for the entire song so that they may be trimmed and aggregated any way that you like.

  • Mel-frequency cepstral coefficients (MFCCs) using Dan Ellis' 'melfcc.m'
    Cepstral Coefficients: 20
    Window length: 0.0232s (512 samples, fs=22050)
    Hop length: 0.0116 (256 samples, fs=22050)
    Min Frequency: 133.33 Hz
    Max Frequency: 6855.6 Hz
    Number of Mel Filter Bands: 40
    Dither: True
    Pre-emphasis Filter: True
    Exponent for liftering: 0
    Magnitude FFT sums to 1
  • Octave-Based Spectral Contrast:
    Uses 7 octave-based bans. Feature is 14 dimensions. The first seven are the spectral valley, the second are the spectral peaks. Contrast is the difference between these dimensions. Spectral valley sorts the values in each band in ascending order and sums the first 2% of the bandwidth. The peaks do the same, but sorting in descending order.
    Window Length: 512
    Hop Length: 256
    Window Type: Hamming
  • Statistical Spectrum Descriptors (SSDs):
    Four dimensional feature comprised of spectral centroid, spectral flux, spectral rolloff, and spectral flatness in that order.
    Window Length: 512
    Hop Length: 256
    Window Type: Hamming
  • Chromagram:
    Extracted using Dan Ellis' 'chromagram_IF.m' with default values.
    Window length: 2048
    Num Bins: 12
    Center Frequency: 1000
    Gaussian SD: 1
  • EchoNest Audio Features:
    Extracted using EchoNest Python API. These features are beat-synchronous and therefore use varying hop times. A vector of window start times is included to help in the aggregation of these features. Timbre, pitches and loudness features are included. For more info on the EchoNest and their features check out:

    Click Here To Download The Song Level Features

  • Contact: