Abstract
Continuous self-reported emotion expressed by four pieces of music were collected on a two-dimensional (valence and arousal) emotion space in a repeated measures (test-retest conditions) design. Initial orientation time (IOT), test-retest reliability and afterglow were examined. Median IOT was 8 seconds. Valence ratings took up to 25 (median 4), and for arousal up to 35 (median 12) seconds. Slower tempi seemed to require longer IOT. Test-retest reliability examined correlation coefficients, and compared periods of sample-by-sample good agreement in response between Test and Retest condition. About 80% of responses were reliable in both the Test and Retest conditions regardless of response dimension. Pearson correlations demonstrated better test-retest reliability for arousal responses than for valence. Retest condition ratings were within 8% of Test condition rating within participant. Average standard deviations for ratings collapsed across dimension, stimulus and conditions was 12.2% of the ratings scale range. Afterglow effects – large outliers in spread of scores just after the end of a piece – were identified. The reliability of continuous emotional response is therefore considered to be quite good, but caution must be taken as to how to deal with the opening and ending of continuous emotional response data.
Keywords
Introduction
Both emotional responses to music, and the music which triggers these emotions, unfold in time. With relatively cheap and fast computers, it is possible to collect emotional response to music and other temporally dependent stimuli in real time, using little more than a computer and a mouse (Cowie et al., 2000; Nagel, Kopiez, Grewe, & Altenmüller, 2007; Schubert, 1999, 2007b) or more specialized devices (Geringer, Madsen, & Gregory, 2004; Stevens et al., 2009). As a result we have been able to investigate aspects of emotion in music that were not afforded to researchers previously because they were restricted to measuring emotional response to music after the musical stimulus was heard, the so-called ‘postperformance’ or retrospective measure (e.g., Asmus, 1985; Gabrielsson & Lindström, 2001; Hevner, 1936). The continuous approach has allowed us to verify that emotions can vary from moment to moment as a piece of music unfolds, and that these variations can only be captured with postperformance methods in a way that is necessarily reductive: the richness of a time-varying emotional experience is potentially lost.
While the number of papers published on continuous emotional response to music are rising, there are relatively few that focus specifically on analytic issues. For example, when a continuous response is reported, how certain can we be that a statistically identical profile will emerge if the experiment were run again at a different time? That is, replication is rare, and the present study aims to address this. Exceptions are Schubert (1999) and Grewe et al. (2007a) but the former only reports general, overall test-retest reliability using Pearson correlations, and the latter took repeated measures to one participant, albeit seven times (see also, Grewe, Nagel, Kopiez, & Altenmüller, 2007b). Further, there are several complications when dealing with time-series data. One of them is the reliability of the initial responses made to a piece of music – a period of ‘settling-in’ during which responses are unreliable. Another complication occurs at the end of a continuous rating collection period, when the participant’s emotion-expressed rating continues despite the absence of the just completed music. Schubert and Dunsmuir (1999) catalogued various kinds of outliers in the continuous music-emotion system. The initial response outlier was classified by them as ‘orientation’ time. The variability in response just after the music had ended was called ‘afterglow’. However, no statistical analysis was provided to diagnose what an initial orientation time might be, and afterglow was described in terms of outliers from a musical feature predictive model. The present study aims to revisit these issues and extend recent research by Bachorik et al. (2009) from the perspective of the variability of multiple participant ratings to the same piece, rather than from a musical feature modelling point of view (for which, see also Korhonen, Clausi, & Jernigan, 2006; Schubert, 2004). In addition, this paper will report how reliable emotion ratings are using a test-retest experimental paradigm.
Methods used to analyse time-series data range from pure visual inspection to sophisticated time-series and functional data analysis methods (see Box, Jenkins, & Reinsel, 1994; Engle, 2001; Gottman, 1981; Ramsay & Silverman, 2002). According to the review of the field by Schubert (2001), most analytic techniques applied to emotion in music data lie on the less sophisticated, visual inspection end of the spectrum, and do not always show awareness of the complications to which time-series data are prone, such as serial correlation (Schubert, 2002). Further, the apparent complexities of the sophisticated analytic techniques discourage some researchers from applying legitimate ways of processing data. In this paper, a simple method of identifying rating agreement in a time series, the Second Order Deviation Threshold, will be applied. It will be described in the analysis section of this paper.
A study by Schubert (1999) (the data of which is the basis of the present study) presented results of two Pearson correlation analyses comparing the Test and Retest conditions. A correlation coefficient of 0.75 for valence (the positive-negative dimension of emotion) and 0.71 for arousal (the arousal-sleepiness dimension of emotion) were reported. Since then, research into self-report continuous emotional response has developed sufficiently that a more sophisticated analysis is timely. For example, there have been several developments in the comparison of such time-series data (Grewe, Kopiez, & Altenmüller, 2009a, 2009b; Grewe et al., 2007a, 2007b; McAdams, Vines, Vieillard, Smith, & Reynolds, 2004; Schubert, 2002; Vines, Krumhansl, Wanderley, & Levitin, 2006). Grewe et al. (2007a) examined responses of a participant who provided continuous responses to the same pieces of music seven times. In that study moment by moment median responses for the pieces were reported, as well as the differenced (explained in more detail later in this paper) responses, and they used the nonparametric Wilcoxon test to identify whether peak moments were significant. McAdams et al. (2004) compared two sets of responses using running (moment by moment) F-tests. The risk of inflated Type I error detection due to multiple testing (because time-series data typically have many samples) were dealt with by setting the error probability threshold to p = 0.01 or 0.001, rather than the more usual p = 0.05 found in conventional parametric data analysis. These approaches are generally appropriate for the analyses of interest in those studies. However, none of the more recent studies have specifically examined test-retest reliability with multiple participants. For example, the F-test and Wilcoxon test mentioned are typically used to identify significant differences between time points, rather than significant similarities across time-series. Schubert (1999) applied Pearson correlation coefficients as a measure of similarity of two time-series data sets, but this only demonstrated dependence, rather than absolute differences, and further, Pearson correlations were reported as producing inflated values unless the data sets were transformed, for example, through differencing (Schubert, 2002; Vines et al., 2006) which is what Schubert reported in the 1999 paper. In the absence of an established, absolute correlation coefficient about which test-retest reliability can be quantified, the approach in the present study is to compare two kinds of responses – valence and arousal – and determine which, if any, is more reliable.
Aim
This aim of the study was to explore the extent to which a continuous emotion rating of music is stable, and to what extent it remains stable when the same piece is rated again at a later date. The focus will be on ratings of expressed emotion made near the beginning of the music, those made at the end of the music, as well as more general test-retest reliability. Further, the study examines whether arousal or valence ratings of emotion (as defined below) are reported with equal reliability.
Method
Participants
Fourteen participants completed the test and retest conditions as part of a larger study (Schubert, 1999). There were six males and eight females. Ages ranged from 19 to 48 years (mean = 29.40, SD = 10.36). Ten participants reporting having some music-playing experience, including eight who reported more than ten years of playing. Four reported never playing a musical instrument.
Stimuli
Four pieces were selected with the intention of depicting a wide variety of emotions. They were chosen from a largely Romantic, orchestral idiom. These pieces were Slavonic Dance Op. 46, No. 1, by Antonin Dvorak (Slavonic Dance), Pizzicato Polka by Johann (Jr) and Josef Strauss (Pizzicato), Morning Mood from Peer Gynt by Edvard Grieg (Morning) and the Adagio Movement from Concierto de Aranjuez for guitar and orchestra by Joaquin Rodrigo (Adagio). Abbreviations used for referring to the pieces are shown in parentheses. Full recording details are provided in the Discography.
Procedure
The study used specially written software that controlled all aspects of participant interaction, interfacing and data collection. The study was conducted twice, first in the ‘Test’ condition, and later in a ‘Retest’ condition, as explained below. The software was run on a Macintosh LC520 computer with one participant tested at a time. In the test condition, participants received training to familiarize themselves with the Two-Dimensional Emotion-Space (2DES), which consisted of valence indicated on the x-axis (positive emotions to the right, negative to the left) and arousal on the y-axis (aroused emotions toward the top, and sleepy emotions towards the bottom), based on the circumplex model and as found on the affect grid of Russell (Russell, 1980; Russell, Weiss, & Mendelsohn, 1989). Such a layout implies that the structure of emotion is dimensional, meaning that emotions are thought of as occupying various locations along a limited number of dimensions (in this case valence – positive vs negative, and arousal – high vs low). However, there are other models, such as a discrete units structure (sad, happy, angry etc.) with no necessary link along underlying dimensions. The debate as to which model is suited to which circumstances is to be found elsewhere (Barrett, 1998; Christie & Friedman, 2004; Eerola & Vuoskoski, 2011; Laukka, 2008; Zentner, Grandjean, & Scherer, 2008). However, for this research the structure of emotion was based on a dimensional model based on valence and arousal. This model was considered appropriate because it suited the logic of moving a mouse around a two-dimensional surface (computer screen). The x (valence) and y (arousal) axes were presented on the computer screen and mouse movements within the bounds of the axes were recorded on a 201 point scale (−100 to +100), with axes crossing at the origin.
The training consisted of three stages, each involving the presentation of a selection of words which encompassed a wide range of valence and arousal values. In one stage, only the arousal axis was displayed, in another only the valence axis was displayed and in the final stage, both arousal and valence axes were displayed as a 2DES. The order of initial presentation of valence and arousal stages was selected randomly across participants. High arousal words used were Afraid, Angry and Happy, while low arousal words of Relaxed and Sad were used. An animation placed each of these words at an appropriate region of the emotion space. This was followed by a practice session, where participants placed these and other words and pictures of faces (based on Ekman & Friesen, 1975) on the emotion space by clicking with the computer’s mouse. After a little practice, participants were able to do this reliably. The results of these responses are reported in Schubert (1999).
Then the participants were presented with the first of the four music stimuli selected at random for each participant. They moved the mouse to the middle of the emotion space and then the music, played over headphones, commenced. The software controlled the audio, which was all contained on one audio CD (see Discography) via the built-in CD-ROM player of the computer. Participants listened to the audio via headphones in a quiet room at a comfortable listening level. The participant was instructed to move the mouse around the emotion space to reflect the emotion that the music was expressing (and not how the music made the participant feel. See Evans & Schubert, 2008; Gabrielsson, 2002; Grewe et al., 2007a; Salimpoor, Benovoy, Longo, Cooperstock, & Zatorre, 2009). When no emotion was detected, participants were instructed to move the pointer to the middle of the emotion space. The four pieces were presented in random order, and participants were encouraged to take a short break between each piece.
Participants were not aware that they would be asked to take part in the study again, and six to twelve months elapsed before they returned to complete the Retest condition. This strategy was intended to minimize the possibility that participants would deliberately try to memorize their responses in the test condition. The exact timing of the second data collection phase could not easily be controlled because not all participants were available at the same time. However, it has been argued elsewhere that variations in time in repeating this kind of task of the duration indicated is unlikely to affect results considerably (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006; Lucas, Schubert, & Halpern, 2010). In the Retest condition, the participants received minimal training which was intended to provide a reminder of the way the emotion space worked. The pieces were presented in random order. In all continuous response recording, participant responses were recorded once per second (Schubert, 2001).
Analysis
Schubert (2007c, 2010; Schubert et al., in press) proposed a general analytic technique that made minimal assumptions about the distribution characteristics of the continuous self-report emotional data. The technique was developed to provide a simple approach to analysing time-series data such as that of the present study. While the method is flexible and can be optimized for different situations, it can be performed on a conventional spreadsheet program using nothing more sophisticated than mean and standard deviation calculation, though other approaches are mentioned in some of the analyses that follow, and in the Discussion section. Further, the method has a visual orientation for easy identification of regions consisting of reliable ratings of the time series. Consider the general time-series X(t, ζi)p,d,c where X is a matrix representing Arousal or Valence (subscript d) ratings in time t (each increment of this index progresses time by one sample) by i participants (where i varies from 1 to 14 for a given sample, and taken together is referred to an ensemble of ratings) to piece (or stimulus) p in condition c (Test vs Retest) at time t. The sample-by-sample ensemble mean is represented as a new time series, M(t)p,d,c, the sample-by-sample ensemble median is represented by a new time series Mdn(t)p,d,c and the sample-by-sample ensemble rating standard deviation time series as SD1(t)p,d,c. This series will be referred to as the first order standard deviation, and abbreviated SD1(t). Lower case subscripts indicate that the series is calculated for each combination (‘p,d,c’ for example, indicates 4 pieces by 2 dimensions by 2 conditions), whereas a capital letter subscript denotes the level of the variable (for example, M p,d,T(t) indicates the sample-by-sample ensemble mean score in the test condition for each combination of piece by dimension). Individual elements of a time series are indicate by an index, such as X(t5, ζ7) which is the fifth time sample rating for participant number seven in the time series rating matrix X. A single real number evaluation of a time series is represented in lower case. For example, the standard deviation of the mean series, M(t), will evaluate to sd1.
The mean of SD1(t),
The standard deviation of the SD1(t) series, which, like its mean, is a real number and will be referred to as the second order standard deviation, abbreviated sd2:
where T is the total number of samples recorded for a given piece in one condition and dimension. This total was dependent on the number of seconds of the CD track, and included about three seconds of silence after the stimulus finished playing.
These values were then used to assign a threshold range, or threshold level. The range or level of the threshold is labelled τ and is defined as
where n is a real number multiplier of sd2 1 . τ±n refers to a range within which elements of the SD1(t) series are to fall to be considered in ‘good agreement’, and otherwise they are considered in ‘poor agreement’ (Equation 3). For the threshold level τn (Equation 4), elements of SD1(t) are coded as ‘poor agreement’ when greater than or equal to τn (because the deviation score for element t in SD1(t) is relatively large) and otherwise as ‘good agreement’ (because the deviation score of the element t in SD1(t) is relatively small). These terms will be applied and further developed in the reporting of the results.
Results
Initial orientation time
Initial orientation time (IOT) in the present study was defined as the time taken for an emotional rating SD1(t) series to settle to within τ±n as described above. This definition is analogous to ‘settling time’ used in off-line controller design in engineering (Tay, Mareels, & Moore, 1998, p. 93). 2 As with settling time, IOT was set as a range rather than a threshold because of the nature of the emotion space procedure used: Participants commenced their emotion ratings at the neutral centre of the emotion space. This means that the start of each piece produces good agreement. However, instructing participants to start in the neutral centre point of the emotion space is a pragmatic decision, and not a reflection of any true underlying agreement. In this case an estimate can be obtained by finding the time taken for the ratings to fall within the ensemble standard deviation range, such as ±n sd2 of the mean SD1(t) value: τ±n. Table 1 shows the results of the analysis conducted with three range criteria: τ±1, τ±1.5 and τ±2.
Initial orientation time by stimulus and dimension.
As the interval about gets wider due to the larger τ±n criterion, we expect to see the initial response settle into the good agreement region sooner (i.e., the estimate will be more liberal). The most dramatic examples of this are in the Arousal response for Adagio (18 s to 5 s in Test condition and 35 s to 11 s in Retest condition for τ±1 and τ±2 respectively – see also Figure 1) and for Pizzicato in the Retest condition (for τ±1 it is 14 s, but τ±1 drops to 3 s). In some cases the IOT is extremely resilient to different τ criteria, such as the valence times for Pizzicato in both conditions (about 4 s), and for the Slavonic Dance: 2 to 5 seconds regardless of dimension and condition. Within this piece Arousal has the longest initial orientation time at 4 to 5 seconds, depending on the threshold criterion. These ‘resilient’ results may be a reflection of the generally smaller SD1(t) movements from the origin throughout the piece: for example, if SD1(t) remained very low throughout the stimulus-dimension-condition combination. Inspection of the eight SD1(t) time-series (Figure 2) indicates quasi-exponential rise, overshoot and decay toward the ‘operating’ region of SD1(t) from the beginning in each case, verifying the presence of an initial orientation (start up) time, analogous to settling time. The Adagio Arousal response is presented as a close-up example in Figure 1. It was chosen because it allows closer inspection of a result with a large disparity in IOT as a function of τ±n, with n = 1, 1.5 and 2. While the τ±2 range produces a smaller, more liberal estimate of the initial orientation time (5 s in Test condition and 11 s in the Retest condition), the τ±1 criterion allows us to see the markedly different effect of initial orientation time across the two conditions (17 s in the Test condition, and 35 s in the Retest condition).

Initial orientation time estimates for Adagio Arousal for Test and Retest conditions using two threshold criteria.

M(t) and SD1(t) time series of each stimulus, dimension and condition combination, showing regions of good agreement at threshold τ1.
Table 2 provides a summary of these results. It shows that across the four stimuli there is no major difference in median initial orientation time across Test and Retest. At τ±1, the arousal ratings oriented one second earlier (11.5 versus 12.5 s) in the Retest, and the valence IOT remained unchanged (4 s) across Test and Retest. It is noted, however, that the maximum IOT tended to be longer in the Retest condition, perhaps reflecting the shorter training session that was performed for that condition regardless of the τ threshold range selected. For arousal, the maximum value increased from 18 s to 35 s, and for valence, from 15 s to 25 s (Test to Retest) for τ±2. The trend was the same for the other two threshold criteria reported, but was most consistent for τ±1.5, in that the maximum values across conditions changed the least for arousal and for valence.
Initial orientation time (in seconds) by response dimension collapsed across stimuli.
A = Arousal; V = Valence.
Test-retest reliability
The reliable responses in each time series (four pieces by two dimensions) were identified by comparing the τ1 threshold of SD1(t)p,d,T and SD1(t)p,d,R conditions. This meant that when elements in an SD1(t) series fall below τ1, the ratings are considered to be in good agreement with each other, and therefore, in part, reliable. This provided an indication of reliability within (i.e., during) the response to each piece and dimension. The τ1 threshold was chosen to allow a reasonably large number of response samples to be considered in good agreement for reasons that will become clear shortly.
Visual inspection of the mean responses, M(t)p,d,c, in Figure 2 comparing each Test-Retest time series pair shows how similar the profiles appear to be for each stimulus-dimension combination. The conventional Pearson test-retest reliability is indicated in Table 3, with all values at 0.83 or higher. However the serial correlation commonly found in continuous response data can inflate the coefficients generated by Pearson correlations. Each of the participant responses were therefore first-order differenced to reduce the effect of serial correlation (as suggested by Schubert, 2002). The differencing transformation of the ensemble mean times series, ΔM(t)p,d,c, produces changes in response made by each participant for a given piece-dimension-condition combination from sample to sample, and is defined in general form as:
Comparison of test-retest condition correlation and good agreement match methods for untreated (X(t)) and differenced (ΔX(t)) ratings.
Notes:
P Pizzicato, M. Morning, S. Slavonic Dance, A. Adagio
A Arousal, V. Valence
Corresponding sample good agreement (at τ1) match count between test and retest as ratio of any (Test or Retest) samples of good agreement (i.e., no agreement at corresponding samples in both test and retest ignored).
Corresponding sample good agreement (at τ1) match count between test and retest as a ratio of all samples (T) of the piece.
If coefficient of one of a valence-arousal pair is greater, it is shown in bold font.
The Test and Retest sample-by-sample ensemble average ΔM(t)p,d,T and ΔM(t)p,d,R were subjected to a Pearson product–moment correlation analysis. This produced a test-retest reliability statistic that was markedly lower, as shown in Table 3, ranging from 0.40 for Adagio Valence, to 0.90 for Slavonic Dance Arousal. Each of these correlation coefficients still produced large effect sizes, except for Adagio Valence which produced a medium effect size (Cohen, 1992). The result demonstrates a clearer picture of relatively better reliability for arousal responses in each case. Only Morning Valence produced a lower Pearson’s r (= 0.76), but this was relatively close to the Morning Arousal Test-Retest correlation coefficient (r = 0.74). To further examine test-retest correlation, a nonparametric approach was applied, where the median of the ensemble ratings were generated to produce the new Mdn(t)p,d,T and Mdn(t) p,d,R series. A Spearman’s rho analysis was performed on these pairs, and is reported in Table 3, as is the differenced version of the median series, ΔMdn(t). In general, all rho values are lower than their corresponding Pearson’s r. However, the comparison is similar, with arousal ratings producing relatively higher rho values. Only the Spearman rho for Adagio produced a tie between arousal and valence when the median ensemble time series was differenced, but in this case the value was low (0.21) and therefore both dimensions have dubious test-retest reliability according to the Spearman rho criterion. This analysis, too, suggests that arousal responses are more reliable. However this is only an indication of reliability from the perspective of the statistical dependency of mean rating time series. It does not provide information, about differences between rating levels nor about regions in time, which may be spurious noise rather than reliable responses. Therefore, two further analyses were performed, one examining difference in test-retest paired ratings, and the other based on good agreement across test-retest conditions.
Paired differences in ratings between corresponding Test-Retest samples for each stimulus-dimension combination were performed by generating the time series that is a result of subtracting each Test sample element of the series from the corresponding Retest condition sample element, participant by participant, as according to the formula for each sample t, of participant i:
producing a new time series for each stimulus-dimension combination, for each participant. The resulting Arousal Test-Retest rating difference series for the four stimuli by fourteen participants were pooled, as were the Valence Test-Retest rating difference series for the four stimuli by fourteen participants.
The median of the Test and Retest difference time series was 0.5 for arousal and -1.0 for valence, with 25th and 75th percentile (and 10th to 90th percentile in parentheses) values of −2 (−6) and 4 (7.85) for arousal, and -5 (−10) and 2.5 (5.5) for valence respectively. This is on a rating scale ranging from -100 to +100 – that is 201 units. Positive values indicate that the rating made in the Test condition was higher than that made in the Retest condition. So, while the differences in paired ratings between Test and Retest were on the whole small, valence ratings were slightly higher in the Retest (by a median of 5 units, compared to 2 units for arousal). Considering the 10th to 90th percentile ranges: 4/5ths of the all pairwise Test-Retest difference time-series elements fell within 13.85(= 7.85 + 6)/210 = 6.89% for arousal and 15.5(= 5.5 + 10)/201 = 7.71% for valence of the 201 point rating scale.
Next, samples with good agreement at τ1 were identified for X(t)p,d,T and X(t)p,d,R (for each stimulus-dimension combination) for the purpose of providing a measure of reliability that would take into account the sample-by-sample agreement of ratings. This was performed over a two-stage process. In the first stage the total number of points that fell within the τ1 threshold were located for each of the sixteen stimulus-dimension-condition combinations. These are shown graphically as the shaded regions in Figure 2. As mentioned, this technique identifies the points that are in good agreement compared to other (unshaded) regions. Although standard deviation is not normally distributed, central limit theory predicts that with a sufficiently large number of samples the distribution will begin to resemble normality (Salkind, 2010), in which case the ratio of points identified would be 0.84 (84%) using a threshold of τ1. Across the sixteen stimulus-dimension-condition combinations, the proportion of high-agreement elements selected ranged from 0.79 (for Adagio Valence) to 0.94 (for Pizzicato Valence). The median of these ratios is 0.84, suggesting an effect of the central limit theory with just 16 distribution samplings (albeit with a positive skew). However, what is important is where these regions of good agreement occur.
In the second stage the corresponding element numbers in the SD1(t)p,d,T and SD1(t)p,d.R series were compared and a new series generated where an element number was set to 1 if both the corresponding element numbers in test and retest condition had a value below τ1 (good agreement) and otherwise 0 (poor agreement). That is, the new corresponding sample agreement match time series is generated by producing a value of 1 when both an element of SD1(t) p,d,T is less than τ1, and the corresponding element of SD1(t) p,d,T is also less than τ1. Otherwise the corresponding element of the new series is set to zero. In Figure 2 this is equivalent to locating the points in a time series where shaded regions of good agreement overlap (intersect) across test-retest plots. The analysis is summarized using two ratios. One ratio reports the count of 1 elements of the corresponding sample agreement match time series (equivalent of summing the elements) denoted by CA, as a proportion of the number of good agreement samples in either condition (number of samples with any shading in the overlapped Test and Retest plot). This means that a sample had to be considered in good agreement in at least one condition, and would ignore samples that are in poor agreement in both conditions (CA/Any columns in Table 3). Such a calculation is not biased by the total number of high-agreement samples; poor-agreement samples will not contribute to the ratio. But it also means that a higher ratio of between-condition good-agreement matches would be reported if there were a larger number of poor-agreement matching samples in both conditions. Therefore a second ratio reports CA as a proportion of the total number of samples (CA/All columns in Table 3, where ‘all’ here means T the total number of time samples for the rating duration of the stimulus). For the moment, we will restrict our analysis to the matches identified from the undifferenced SD1(t) series.
As can be seen in Table 3, the fewest relative matches (regardless of ratio calculation) occur in Slavonic Dance Arousal (72% and 69% of samples with matches for CA/Any and CA/All, respectively). The highest ratio of matches occurred for Pizzicato Arousal, again regardless of the ratio (94% and 82% of any good match and of total samples respectively). It will now become clear why a threshold (τ1) was chosen that identified a reasonably large number of samples in good agreement — when comparing across conditions: a more stringent τn criterion would have led to a loss of good-agreement points matching in both conditions. Hence, issuing an overly conservative threshold risked having no Test-Retest matches at all, thus preventing meaningful comparison.
The question at hand is to compare the reliability, according to this analysis, of arousal versus valence. The overall median for arousal match is 81% and for valence 84% as a proportion of any good agreement samples. As a proportion of all samples the overall median is 77% for both. Given the fluctuations of this proportion across stimuli, it is not possible to conclude that there is a difference in corresponding agreement between arousal and valence using this approach. In general, participants agree with each other’s responses in the same regions of a piece when responding on different occasions. According to the present calculations, based on a τ1 threshold, this occurs at least two thirds of the time (69% worst case), is typically about 80% of the time (average of the median results for all ratios) and up to 94% of the time among the stimuli investigated. However, corresponding agreement in Valence Test-Retest is proportionally higher than Arousal for Slavonic Dance and Adagio. When the same agreement matching analysis is conducted on the differenced data – that is ΔX(t)p,d,T with ΔX(t)p,d,R – Arousal consistently has a higher proportion of Test-Retest matches in agreement for all stimuli compared to Valence (Table 3). Since changes in arousal are sensitive to changes in loudness (Schubert, 2004) it seems reasonable that pieces with many intense changes in dynamics (such as Slavonic Dance and Adagio) will lead to a higher number of changing responses. So, this finding is likely to be related to the Romantic style, orchestral instrumentation of the selected music.
The final analysis concerning test-retest reliability involves an examination of the standard deviation time series across each stimulus-dimension-condition combination: SD1(t)p,d,c. Figure 3 shows a comparison of various statistics (box-plot, minimum, maximum, mean and τ±1) applied to the SD1(t) time series. The height of each median and mean can be thought of as a ‘disagreement’ or ‘confusion’ estimate, because high mean/median indicates larger SD1(t) elements in the course of the emotional response dimension for that piece. Mean SD1(t) (i.e., ) values ranged from 21 (Slavonic Valence – both test and retest, median also 21 for both), to 28 (Adagio – both test and retest, median also 28 for both) units on a 201 point scale, which translates to 10% and 14% of the response scale range respectively. Median values produce similar results (median line of box plots are close to or overlap the large crosses representing mean, with mean displaying a small positive skew in all cases). Slavonic Dance Arousal Retest SD1(t)S,A,R had the smallest interquartile range (6.42 units), while Adagio Arousal Test had the largest (11.5 units). The τ±1 and interquartile range changed little from Test to Retest within stimulus and dimension, however a small trend could be observed with Retest variability being slightly smaller than Test variability (e.g., compare interquartile range of Slavonic Dance Valence, noting that it is smaller in the Retest than the Test condition). No clear trend could be observed across dimension, but overall stimulus-wise differences could be observed, with Slavonic Dance producing higher agreement, with least heteroskedasticity (see the relatively low , and small interquartile and τ±1 range in Figure 3) of the four stimuli, and Adagio producing more poor agreement than the other stimuli (both according to τ±1 and interquartile distances). Adagio Valence had the highest deviation scores, with a peak (maximum) of 49 units for both Test and Retest conditions (24% of the 201 point scale). Across the entire responses, collapsed across stimulus-dimension-condition, the mean standard deviation was 26.6 units, which is 12.2% of the response scale range (median = 12.6%).

Central tendency (mean and median) and spread (sd and box plot) of SD1(t)p,d,c.
Afterglow
Afterglow was calculated by examining the SD1(tT)p,d,c values, the ensemble standard deviations of the last sample, calculated for each stimulus-dimension-condition combination, which was approximately three seconds after the offset of the last note of the piece (see Discography for tT values of each stimulus). It was not possible to obtain a precise offset time as different recordings had different amounts of reverberation. To put the final sample deviation score into perspective, it was divided by sd2. Table 4 lists SD1(tT)p,d,c for each stimulus-dimension-condition combination, both in rating scale units and as a multiple of sd2. For comparison, the initial sample standard deviation, SD1(t1)p,d,c is shown, which was expected to be small because all participants commenced their response from the same, central region on the emotion space. The peak SD1(t) value of each series is also shown. The overall highest peak SD1(t)p,d,c value was 9.64 sd2 units which occurred in the last sample of the Slavonic Dance Arousal. This is symptomatic of the final, SD1(tT)p,d,c, values, with values close to or matching the peak SD1(t) value. The average of the peak values across stimuli and dimensions was 6.72 sd2 units, whereas the final, afterglow SD1(tT) values were 6.00 sd2 units. For comparison, the initial SD1(t1)p,d,c averaged only 0.3 sd2 units across stimulus-dimension-condition combinations. No obvious trends could be identified in afterglow for emotion dimension, or for condition (Test versus Retest). But the afterglow effect, indicated by large disagreement in ratings, is clearly present.
SD1(t) values at beginning, peak and ending (afterglow) of ensemble ratings and as proportions of sd2.
Discussion
Initial orientation time
Initial orientation time varies from 2 to 35 seconds, with an overall median of 8 seconds for the stimuli investigated, consistent with Bachorik et al. (2009). This means that the first seconds of response in a continuous response task must be treated with caution. It is likely that the present estimates are low because substantive training was provided, using examples from all quadrants of the emotion space, with worked examples, and asking the participants to perform the task with words and pictures of faces before commencing continuous responses to music. The reliability of continuous responses can therefore be improved by disregarding an opening period of response. Caution is also required when no training is given. Further studies should explore whether there is a training–orientation-time trade off, for if little instruction is provided, perhaps the participant will find their way around the response space or scale, but to reach a point of reliability will take longer (longer initial orientation time) which can later be diagnosed and discarded if necessary.
The present study does not, however, provide a dictum as to how long the initial orientation time will be in all or even similar continuous emotional response paradigms. The variability of the results provides evidence for this. It seems that initial orientation time is more complex than a participant needing a cognitive adjustment and processing time. It could be that some music facilitates consistent emotional response ratings more rapidly, and it might be that certain musical characteristics or musical style affect this time requirement (Bachorik et al., 2009). The overall finding that valence had a shorter initial orientation time than arousal (4 s versus 12 s median times) suggests that valence judgements are things with which we are well experienced – evaluating whether something is positive or negative. But this is a dubious conclusion because arousal may be even more privileged in speed of response (e.g., see Cuthbert, Schupp, Bradley, Birbaumer, & Lang, 2000). A more plausible explanation is that there are more clear and unambiguous indicators of valence, in particular mode, than arousal for the examples used. Morning, Pizzicato and Slavonic Dance all produce a clear expression of major tonality at the very opening of the piece, and hence a clear positive valence (Pittenger, 2003; Webster & Weir, 2005), although Morning may produce a slight ambiguity as the melody outlines a descending minor third from the fifth degree until it reaches the end of its descent at the tonic which is clearly in the major mode. Adagio is in a natural minor mode, and this provides a strong sense of negative valence. Yet, it produces long initial orientation times of 15 s in the test and 25 s in the retest condition (though still shorter than the corresponding arousal response). It is therefore somewhat a surprise that Morning produced a median initial orientation time of 4 s in the retest (same as the overall median) instead of something extended as in Adagio. Tempo may provide some explanation as the shortest initial orientation time was for the fastest piece, Slavonic Dance, and the slowest was for Adagio, regardless of emotional response dimension or Test-Retest condition. So, musical characteristics may well have a role to play, and selection of music may therefore have a bearing on the reliability of responses at the start of a continuous emotional response task.
Test-retest reliability
The emotion ratings were examined in different ways to investigate the reliability of continuous emotional ratings. Because there are some moot issues regarding the reporting of test-retest reliability for continuous data, the focus was on comparative reliability, and in particular whether arousal responses are more, less, or equally reliable to valence responses. We have seen, above, that valence response has some advantage in terms of smaller initial orientation time. However, Test-Retest correlation of first-order differenced data suggested that arousal reliability is higher than valence reliability (average r = 0.79 and 0.54 respectively, undifferenced r = 0.97 and 0.92 respectively). But is this sufficient to conclude that continuous arousal response is more reliable than continuous valence response? Does it mean that when the two are combined onto an emotion-space at right angles that valence responses suffer? Correlations are sensitive to the amount of systematic variance in the variables under comparison. For example, the variability in the longest piece, Adagio – lasting over ten minutes – is relatively small in valence response, certainly compared to the responses to the other pieces (see Figure 2). So, it could be that the lack of variability was responsible for the lowest correlation coefficient, of 0.37, for differenced Adagio Valence response, rather than a lack of reliability per se. To conclude that arousal responses are more reliable than valence responses would therefore necessarily be tentative.
A method that takes into account the within-participant variability was applied as another way of examining test-retest reliability. By comparing samples of relatively high agreement using the second order standard deviation method between Test-Retest conditions, no major differences could be identified, and no difference could be identified between valence and arousal. While further work is needed to determine what an adequate measure and level of reliability is, the present study proposes a technique that provides an alternative, but more so an addendum, to Pearson correlations in those instances where the correlation coefficient may be susceptible to rogue effects. For the method used, the typical level of reliability was quantified at 80%, meaning that participant responses were in relatively high agreement at the same points in time across test and retest conditions (using a τ1 threshold criterion). This reliability varied from 69% to 94% of the time across the various stimulus-dimension combinations.
The final comparative analysis of reliability directly examined the mean of the standard deviations for each stimulus-dimension combination for Test and Retest conditions. The results indicate similar standard deviation (agreement) in the Retest condition. This result does not support a cognitive load explanation, which predicts that, because of practice, the participant does not require as much extraneous cognitive load for operating the interface and can focus more on the experimental task (Paas, Tuovinen, van Merrienboer, & Darabi, 2005; Sweller, 1988; Sweller, 2006). It would be interesting to see whether subsequent retests would further change agreement. If emotional ratings of music change due to multiple listenings (Balkwill & Thompson, 1999; Ritossa & Rickard, 2004), then it is likely that deviation scores will also change. However, it is more likely that this will be the case when it is the felt emotion that is rated, rather than (as in the present study) that which the music appears to be conveying (Grewe et al., 2007b; Salimpoor et al., 2009; Schubert, 2007a).
Afterglow
Participants are meant to return the cursor to the centre of the emotion space when the piece ends. They should do this because they were instructed to do so, but also, logically, if rating the emotion expressed by the music, and the music stops, the music cannot be expressing further emotion. The analysis provided evidence for the presence of an afterglow effect, caused by some participants not returning to the centre at the conclusion of the music. The lowest relative spread of scores at the end of any stimulus-dimension combination was 4.2 times the sd2 of Adagio Valence, and the highest relative spread of scores was 9.64 times the sd2 for Slavonic Dance Arousal which corresponds to the largest ‘outlier’ point for that entire piece. While there were no systematic differences in median afterglow values across dimensions, the values were slightly higher for the median Retest condition. There is, therefore, an unreliable response at the end of the continuous emotion rating task and this presents a message to continuous response researchers that the last portion of a response, like the initial orientation time, should be treated with caution, and possibly deleted from analysis if appropriate.
The explanation of the afterglow effect can only be speculated upon here. Some of the participants seem to become unaware of the requested task – unable to report the emotion that the music is expressing, and confusing the request (expressed emotion) rating with their own feelings. Another (though not independent) explanation is that the serial correlation of the recent responses has bled through to after the end of the piece. If this is the case, it is a further vindication of the differencing technique, which can mathematically reduce the effect of such inertial responses. In non-music literature, a similar effect is referred to as response inhibition – the time it takes to stop performing a task upon presentation of a ‘stop task’, though this inhibition time delay is of a smaller magnitude than the durations found in the present study (see Pessoa, Padmala, Kenzer, & Bauer, 2011; Verbruggen & Logan, 2009). Differencing ratings is an alternative to deleting data in cases where it is desirable to use as much of the time series as possible.
The second order deviation threshold method
The second order deviation threshold method was used as a tool to help induce criteria for sample-by-sample agreement in a time series. It is a flexible method, and while the present study used only mean and standard deviation calculations to determine the good agreement threshold criteria, other statistics should be considered in the future. For example, instead of the mean, a less biased central tendency estimator, such as the median, could be used (see also Grewe et al., 2007a; Grewe et al., 2009a; Korhonen et al., 2006). Further, alternative estimates of the spread of scores should be considered, such as the interquartile distance (Grewe et al., 2009a; Grewe et al., 2007a) and the median absolute distance (Schubert et al., in press). So, for example, for the present study, the median, first order interquartile and second order interquartile could be applied in an analogous manner instead of the mean, first order standard deviation and second order deviation respectively. Some examples of these nonparametric statistics were reported in this study.
However the second order deviation threshold does have limitations. It is a relative measure of reliability. Since SD1(t) data are presented as a ratio of sd2 the method is identical in principle to effect size (Cohen, 1992), and is appropriate for comparing conditions, in the present case arousal and valence responses. In the present study, the technique, used along with other approaches, provides a tool for diagnosing initial orientation time, reliability and afterglow effects in continuous emotional responses to music. It demonstrates the presence of potentially shorter initial orientation time in valence, with respect to arousal, no difference (or equal reliability) across Test-Retest conditions for both arousal and for valence responses, and the presence of an unreliable or misleading ‘afterglow’ at the conclusion of the piece.
Conclusion
Test-retest reliability of four pieces of non-vocal, orchestrals music made by fourteen participants separated by a long period of time (six to twelve months) suggests that there is good agreement in identifying emotion expressed by music. The method of analysis employed confirmed and identified the presence of some important characteristics of time-series data collected from emotional response tasks to music. While it can be concluded that test-retest responses are generally reliable – the current approach reporting agreement across conditions about 80% of the time – the opening moments of the music and the end of the music in general produce unreliable results. Responses to the start of a piece of music are marked by a period of orientation time that, in the present study, ranged from a couple of seconds to over thirty seconds. It was longer for arousal response than for valence response, and was typically eight seconds long. The conclusion drawn is that the initial emotional responses need to be treated with caution and perhaps eliminated from some analyses. It is also speculated that training may reduce the orientation time, but repeated exposures may alter emotional responses, though these issues were not an explicitly manipulated in the present study.
Researchers are aware of the unreliability of the responses to the opening section of a piece of music – for example, Grewe et al. (2007a) omitted for first 10 seconds of their response data. However, the present study provides a technique for diagnosing the duration of this orientation time. This will increase the amount of ‘useful’ response time that can be analysed, and highlights the variability of initial orientation time, which seems to be a function of musical features, with tempo being speculated as an important factor – slow pieces used in the present study had longer orientation times, and fast pieces had shorter ones.
Afterglow effects were also quantified in this study, and indicate the potential lack of reliability at the end of a continuous rating of expressed emotion by music task. Some of the largest deviation scores were identified at the responses just after the music ended, despite the instruction (returning to the neutral point) which should have produced very small ensemble rating deviations.
Continuous response methods offer exciting insights into how emotion in music works. The present study demonstrated how variations in responses can be identified visually, and can therefore be adapted to more sophisticated problems such as modelling the response as a function of other variables (e.g., Korhonen et al., 2006; Schubert, 2004) in the future. Further work is needed to investigate the nature of these various factors that influence reliability, and a wider range of musical styles (Bachorik et al., 2009) will determine whether the results are limited to Western orchestral music from part of the last two centuries. As continuous response techniques are adopted by emotion in music researchers, the implications of issues such as reliability at the beginning, middle and end of the time dependent data sets need also be considered if the results of the data sets and subsequent conclusions are to be optimized.
Footnotes
Acknowledgements
This research was conducted with support from the Australian Research Council (DP0452290 and DP0986153). The author is grateful for the comments of two anonymous reviewers, and for the continuing support of Visiting Associate Eric Sowey and Professor William Dunsmuir.
