Emotional ratings of words are in high demand because they are used in at least four lines of research. The first of these lines concerns research on the emotions themselves: the ways in which they are produced and perceived, their internal structure, and the consequences that they have for human behavior. For instance, Verona, Sprague, and Sadeh (2012) used emotionally neutral and negative words in an experiment comparing the responses of offenders without a personality disorder to those of offenders with an antisocial personality disorder who either did or did not have additional psychopathic traits.

The second line of research deals with the impact that emotional features have on the processing and memory of words. Kousta, Vinson, and Vigliocco (2009) found that participants responded faster to positive and negative words than to neutral words in a lexical-decision experiment, a finding later replicated by Scott, O’Donnell, and Sereno (2012) in sentence reading. According to Kousta, Vigliocco, Vinson, Andrews, and Del Campo (2011), emotion is particularly important in the semantic representations of abstract words. In other research, Fraga, Piñeiro, Acuña-Fariña, Redondo, and García-Orza (2012) reported that emotional words are more likely to be used as attachment sites for relative clauses in sentences such as “Someone shot the servant of the actress who. . . .”

A third approach uses emotional ratings of words to estimate the sentiments expressed by entire messages or texts. Leveau, Jhean-Larose, Denhière, and Nguyen (2012), for instance, wrote a computer program to estimate the valence and arousal evoked by texts on the basis of word measures (see also Liu, 2012).

Finally, emotional ratings of words are used to automatically estimate the emotional values of new words by comparing them to those of validated words. Bestgen and Vincze (2012) gauged the affective values of 17,350 words by using the rated values of words that were semantically related.

So far, nearly all studies have been based on Bradley and Lang’s (1999) Affective Norms for English Words (ANEW) or on translated versions (for exceptions, see Kloumann, Danforth, Harris, Bliss, & Dodds, 2012; Mohammad & Turney, 2010). These norms include ratings for 1,034 words. Three types of ratings were carried out, in line with Osgood, Suci, and Tannenbaum’s (1957) theory of emotions. The first, and most important, type of ratings concerns the valence (or pleasantness) of the emotions invoked by a word, going from unhappy to happy. The second addresses the degree of arousal evoked by a word, and the third dimension refers to the dominance/power of the word—the extent to which the word denotes something that is weak/submissive or strong/dominant.

The number of words covered by the ANEW norms appeared sufficient for use in small-scale factorial experiments. In these experiments, a limited number of stimuli would be selected that varied on one dimension (e.g., valence) and were matched on other variables (e.g., arousal, word frequency, and word length). However, the number of words in this set is prohibitively small for the large-scale megastudies that are currently emerging in psycholinguistics. In these studies (e.g., Balota et al., 2007; Ferrand et al., 2010; Keuleers, Brysbaert, & New, 2010; Keuleers, Lacey, Rastle, & Brysbaert, 2012), regression analyses of thousands of words are used to disentangle the influences on word recognition. The ANEW norms are also limited as input for computer algorithms that gauge the sentiment of a message/text or the emotional values of nonrated words.

Given the ease with which word norms can be collected nowadays, we decided to collect affective ratings for a majority of the well-known English content words (a total of 13,915). Because it would be expected that the emotional values would generalize to inflected forms (e.g., sings, sang, sung, and singing for the verb lemma sing), we only included lemmas (the base forms of words—i.e., the ones used as entries in dictionaries). Our sample of words (see below for the selection criteria) substantially covers the word stock of the English language and forms a solid foundation from which to automatically derive the values of the remaining words (Bestgen & Vincze, 2012).

Method

Stimuli

The words included in our stimulus set were compiled from three sources: Bradley and Lang’s (1999) ANEW database, Van Overschelde, Rawson, and Dunlosky’s (2004) category norms, and the SUBTLEX-US corpus (Brysbaert & New, 2009). Our final set included 1,029 of the 1,034 words from ANEW (five were lost due to programmatic error) and 1,060 of the participant-generated responses to 60 of the 70 category names included in the category norm study (we did not include a few categories, such as units of time and distance or types of fish). The remaining words were selected from the list of 30,000 lemmas for which Kuperman, Stadthagen-Gonzalez, and Brysbaert (2012) collected age-of-acquisition ratings. This list contains the content lemmas (nouns, verbs, and adjectives) from the 50-million-token SUBTLEX-US subtitle corpus. We only selected the highest-frequency words known by 70 % or more of the participants in Kuperman et al., given that affective ratings are less valid/useful for words that are not known to most participants. Our final set included 13,915 words, of which 22.5 % are most often used as adjectives (Brysbaert, New, & Keuleers, 2012), 63.5 % as nouns, 12.6 % as verbs, and 1.4 % as other or unspecified parts of speech. The mean word frequency of the set was 1,056 (SD = 8,464, range = 1 to 314,232, median = 87) in the 50-million-token SUBTLEX-US corpus; 152 words, or 1 %, had no frequency data. For each word in our set, we collected ratings on three dimensions using a 9-point scale.

The stimuli were distributed over 43 lists containing 346 to 350 words each. Each list consisted of 10 calibrator words, 40 control words from ANEW, and a randomized selection of non-ANEW words. The calibrator words were drawn from ANEW and were chosen separately for each of the three dimensions, with the goal of giving participants a sense of the entire range of the stimuli that they would encounter.Footnote 1 Participants always saw these calibrator words first. The remaining ANEW words were divided into sets of 40 and served as controls for the estimation of correlations between our data and the ANEW norms. This meant that a selection of these words appeared in more than one list and that the lists used for each of the three dimensions were mostly, but not completely, identical. The control words and the non-ANEW words were randomly mixed together in each list. Once lists were created, the words in each one were always presented in a fixed order following the calibrator words.

Data collection

Participants were recruited via the Amazon Mechanical Turk crowdsourcing website. Responders were restricted to those who self-identified as being current residents of the US and who completed any given list only once. This completion of a single list by a given participant will henceforth be referred to as an assignment. Each assignment involved rating words on a single dimension only, in contrast to the ANEW study, for which participants rated each word on all three dimensions. The instructions given were minor variations on the instructions in the ANEW project, and are given below, with the respective changes to the wording for the separate dimensions indicated in square brackets.

You are invited to take part in the study that is investigating emotion, and concerns how people respond to different types of words. You will use a scale to rate how you felt while reading each word. There will be approximately 350 words. The scale ranges from 1 (happy [excited; controlled]) to 9 (unhappy [calm; in control]). At one extreme of this scale, you are happy, pleased, satisfied, contented, hopeful [stimulated, excited, frenzied, jittery, wide-awake, or aroused; controlled, influenced, cared-for, awed, submissive, or guided]. When you feel completely happy [aroused; controlled] you should indicate this by choosing rating 1. The other end of the scale is when you feel completely unhappy, annoyed, unsatisfied, melancholic, despaired, or bored [relaxed, calm, sluggish, dull, sleepy, or unaroused; in control, influential, important, dominant, autonomous, or controlling]. You can indicate feeling completely unhappy [calm; in control] by selecting 9. The numbers also allow you to describe intermediate feelings of pleasure [calmness/arousal; in/under control], by selecting any of the other feelings. If you feel completely neutral, neither happy nor sad [not excited nor at all calm; neither in control nor controlled], select the middle of the scale (rating 5).

Please work at a rapid pace and don’t spend too much time thinking about each word. Rather, make your ratings based on your first and immediate reaction as you read each word.

On average, assignments were completed in approximately 14 min. Participants received 75 cents per completed assignment. After reading an informational consent statement and the instructions, participants were asked to indicate their age, gender, first language(s), country/state resided in most between birth and age 7, and educational level. Subsequently, they were reminded of the scale anchors and presented with a scrollable page in which all words in the list were shown to the left of nine numbered radio buttons. Although we did not incorporate the Self-Assessment Manikins (SAM) that were used in the ANEW study, we did anchor our scales in the same direction, with valence ranging from happy to unhappy, arousal from excited to calm, and dominance from controlled to in control. In the Results and Discussion section, we show that our numerical ratings correlated highly with the SAM ratings from ANEW, demonstrating that the methods are roughly equivalent. Once finished, participants clicked “Submit” to complete the study.

Lists were initially presented to 20 respondents each. However, missing values due to subsequent exclusion criteria resulted in some words having fewer than 18 valid ratings. Several of the lists were reposted until the vast majority of the words had reached at least this threshold. Data collection began on March 14, 2012, and was completed May 30, 2012.

Results and discussion

Data trimming

Altogether, 1,085,998 ratings were collected across all three dimensions. Around 3 % of the data were removed due to missing responses, lack of variability in responses (i.e., providing the same rating for all words in the list), or the completion of fewer than 100 ratings per assignment. The valence and arousal ratings were reversed post-hoc to maintain a more intuitive low-to-high scale (e.g., sad to happy rather than happy to sad) across all three dimensions. Means and standard deviations were calculated for each word. Ratings in assignments with negative correlations between a given participant’s rating and the mean for that word were reversed (9 %). This was done on the basis of both empirical evidence that higher numbers intuitively go with positive anchors (Rammstedt & Krebs, 2007) and an examination of these participants’ responses, which revealed unintuitive answers (e.g., indicating that negative words such as “jail” made them very happy). Any remaining assignments with ratings that correlated with the mean ratings per items at less than .10 were removed, and the means and standard deviations were recalculated. The final data set consisted of 303,539 observations for valence (95 % of the original data pool), 339,323 observations for arousal (89 % of the original data pool), and 281,735 observations for dominance (74 % of the original data pool). A total of 1,827 responders contributed to this final data set, with 362 of them completing assignments for two or more dimensions. A total of 144 participants completed two or more assignments within a single dimension.

For valence, 51 words received fewer than 18 (but more than 15) valid ratings. For arousal, 128 words had a total number of ratings in that range. For dominance, 564 words had a total of either 16 or 17 ratings, and 17 words had 14 or 15 ratings each. For all three dimensions, more than 87 % of the words had between 18 and 30 ratings per word. A total of 50 words in each dimension received more than 70 ratings each, due to the doubling up of ANEW words and the rerunning of lists. To illustrate how our data enriches the set of words available in ANEW, Table 1 provides examples of words that are not included in the ANEW list and that show very high or very low ratings in one of the three dimensions.

Table 1 Words at the extremes of each dimension that were not included in ANEW

Demographics

Of the 1,827 valid responders, approximately 60 % were female in all three cases (419 valence, 448 arousal, and 505 dominance). Their ages ranged from 16 to 87 years, with 11 % being 20 years old or younger; 45 % from 21 to 30; 21 % from 31 to 40; 11 % from 41 to 49; and 12 % age 50 or older. Of the participants, 24 (3.3 %), 32 (4.3 %), and 23 (2.7 %) for the valence, arousal, and dominance dimensions, respectively, reported a native language other than English, while 10 (1.4 %), 12 (1.6 %), and 12 (1.4 %) participants, respectively, reported more than one native language, including English. Table 2 shows the numbers of participants at each of the seven possible education levels. Most had some college or a bachelor’s degree.

Table 2 Reported education levels within each dimension

Descriptive statistics

Table 3 reports descriptive statistics for the three distributions of ratings. The distributions of both valence and dominance ratings are negatively skewed (G 1 = −.28 and –.23, respectively), with 55 % of the words rated above the median of the rating scale for both dimensions (see Fig. 1). The Mann–Whitney one-sample median test indicated that the medians of both the valence and dominance distributions were not significantly different from rating 5, which is the median of the scales (both ps > .1). The tendency for more words to make people feel happy and in control goes along with numerous former findings of positivity biases in English and other languages (see Augustine, Mehl, & Larsen, 2011, and Kloumann et al., 2012 ). The positivity bias—or the prevalence of positive word types in English books, Twitter messages, music lyrics, and other genres of texts—is argued to reflect the preference of humankind for pro-social and benevolent communication. Arousal, on the other hand, is positively skewed (G 1 = .47), meaning that only a relatively small proportion of words (20 % above a rating of 5) made people feel excited.

Table 3 Descriptive statistics for the distribution of each dimensions, including the number of participants (N), number of observations, average mean, and average SD
Fig. 1
figure 1

Distributions of valence (green), arousal (red), and dominance (blue) ratings. Dotted lines represent the medians of the respective distributions

Ratings of valence were relatively consistent across participants, while arousal and dominance were much more variable. This is indicated by the difference between the average standard deviations of the dimensions: 1.68 for valence, but 2.30 and 2.16 for arousal and dominance, respectively. In addition, the split-half reliabilities were .914 for valence, .689 for arousal, and .770 for dominance; see below for other examples of a higher variability of dominance and arousal ratings. Figure 2ac show, for the three emotional dimensions, the means of the ratings for each word plotted against their standard deviations, with each scatterplot’s smoother lowess line demonstrating the overall trend in the data (red solid lines). For illustrative purposes, each plot is supplied with selected examples of words that are substantially more or less variable than other words with the given mean rating. Swear words, taboo words, and sexual terms account for a disproportionally large number of words that elicit more variable ratings of valence and arousal than would be expected given the words’ mean ratings (shown as words in blue above the red lowess line in Fig. 2ac), in line with Kloumann et al. (2012). Below we will demonstrate that the greater variability for such words may be due to gender differences in the norms.

For valence, the scatterplot in Fig. 2a (top left) is symmetrical about the median, with relatively positive or negative words associated with smaller variability in the ratings across participants, as compared to valence-neutral words (see Moors et al., in press, for a similar finding in Dutch). The same holds for the pattern observed in the dominance ratings, Fig. 2c (bottom left). The plot of valence strength (absolute difference between the valence rating and the median of valence ratings; Fig. 2d) corroborates the tendency of more extreme (positive or negative) words to be less variable in their ratings than neutral ones. In contrast, for arousal in Fig. 2b (top right), words that make people feel calm generally elicit more consistent ratings than do those that make people feel excited. To sum up, in terms of the variability of ratings, valence and dominance pattern together and are best considered in terms of their magnitude (how strong is the feeling) rather than their polarity (sad vs. happy, or controlled by vs. in control); polarity, however, determines variability in the arousal ratings.

Fig. 2
figure 2

Standard deviations of ratings for valence (a, top left), arousal (b, top right), dominance (c, bottom left), and valence strength (d, bottom right) plotted against the respective mean ratings. Panels a–c also provide examples of words with disproportionately large and small standard deviations, given their means

Correlations between dimensions

We found the typical U-shaped relationship between arousal and valence (see Fig. 3a; Bradley & Lang, 1999; Redondo, Fraga, Padrón, & Comesaña, 2007; Soares, Comesaña, Pinheiro, Simões, & Frade, 2012): Words that are very positive or very negative are more arousing than those that are neutral. This is corroborated by the positive correlation between valence and arousal for positive words (mean valence rating > 6; r = .273, p < .001) and the negative correlation between valence and arousal for negative words (mean valence rating < 4; r = −.293, p < .001). The relationship between arousal and dominance is also U-shaped (see Fig. 3b), as corroborated by the positive correlation between dominance and arousal for high-rated dominance words (mean rating > 6; r = .139, p < .001) and the negative correlation between dominance and arousal for low-rated dominance words (mean rating < 4; r = −.193, p < .001). The relationship between valence and dominance is linear, with words that make people feel happier also making them feel more in control (see Fig. 3c). Table 4 shows that a quadratic relationship between arousal and valence and between arousal and dominance explains more of the variance than does a linear relationship. However, this does not rule out the possibility that the high and low levels of these associations might be explained better by a regression with a break point at the median of the scale (see Fig. 3). The relationship between dominance and valence, however, is fitted better by a linear model.

Fig. 3
figure 3

Scatterplots of dimensions (a, arousal vs. valence; b, arousal vs. dominance; c, dominance vs. valence), along with lowess lines (in red) showing the functional relationships, and regression lines for arousal as predicted by high (in green) and low (in purple) valence and dominance. Sample words have also been included

Table 4 Pearson’s correlations, linear and quadratic coefficients and the quadratic R 2 for each dimension

The strength of the correlation between dominance and valence casts doubt on the claim that the three dimensions under consideration here are genuinely orthogonal affective states. This assumption was the basis of the original ANEW study (Bradley & Lang, 1999), stemming from original factor analyses done by Osgood, Suci, and Tannenbaum (1957). Future research will have to demonstrate that dominance explains unique variance over and above valence in language-processing behavior. The fact that extreme values of valence and dominance are more arousing point again at the utility of considering valence/dominance strength (i.e., how different a word is from neutral) rather than polarity as the explanatory variable. We return to this point below.

Reliability

We compared our ratings with several smaller sets of ratings that had been collected previously by other researchers, including the ANEW set from which we drew our control words. The correlations are listed in Table 5.

Table 5 Correlations of present ratings with similar studies across languages

Valence appears to generalize very well across studies and languages, as evidenced by high correlations. Both arousal and dominance show more variability across languages and studies, as reflected in the lower correlations. Note that these studies themselves (those that have reported the information—i.e., c, d, and e) also found a lower correlation between their arousal and dominance ratings and the arousal and dominance ratings reported in other studies (arousal range = .65 to .75; dominance range = .72 to .73). Importantly, however, cross-linguistic correlations were stronger (the range of Pearson’s r for arousal was .575–.759) than those between gender, age, and education groups within our study (the range of Pearson’s r was .467–.516), see Table 8 below. This observation clearly indicates the validity of using emotional ratings to English glosses of words in a language that does not have an extensive set of ratings at the researcher’s disposal. This seems to be more the case for valence and dominance than for arousal.

Correlations with lexical properties

As is known for other subjective ratings of lexical properties (cf. Baayen, Feldman, & Schreuder, 2006), judgments of the emotional impact of a word are likely to be affected by other aspects of the word’s meaning. Table 6 reports correlations of valence, arousal, and dominance with a range of available semantic variables. In the remainder of the article, words, rather then the trial-level data, were chosen as units of the correlational analyses.

Table 6 Correlations between emotional dimensions and semantic variables reported in prior studies [degrees of freedom are based on the numbers of data points reported as N (Overlap)]

Most of the correlations that the emotional ratings show with other semantic properties are weak to moderate (Cohen, 1992), with the exception of correlations with variables that directly tap into emotional states (h and i in Table 6). Specifically, words that make people happy are easier to picture [r(5123) = .161, p < .001] and more concrete [r(1565) = .105, p < .001], familiar [r(2904) = .206, p < .001], context rich [r(316) = .196, p < .001], and easy to interact with [r(1396) = .203, p < .001], are of high frequency [r(13763) = .182, p < .001], and are learned at an early age [r(13707) = −.233, p < .001]. They are also associated with low pain [r(501) = −.456, p < .001], intense smell [r(501) = .139, p < .01], vivid color [r(1281) = .322, p < .001], pleasant taste [r(501) = .309, p < .001], quiet sounds [r(501) = −.176, p < .001], and stillness [r(501) = −.113, p < .05]. Virtually all of these properties are also associated with words that make people feel in control; that is, they correlate in the same way with dominance ratings.

Words that make people feel excited are more ambiguous [r(1565) = −.258, p < .001], unfamiliar [r(501) = −.193, p < .001], context impoverished [r(316) = −.147, p < .01], and difficult to interact with [r(1396) = −.143, p < .001]. They are also associated with strong general sensory experience [r(5005) = .228, p < .001], specifically with high pain [r(501) = .579, p < .001], unpleasant taste [r(501) = −.102, p < .05], intense sounds [r(501) = .407, p < .001], motion [r(1281) = .335, p < .001], and an inability to be grasped [r(501) = −.121, p < .01].

As correlations do not reveal the form of the functional relationships, Fig. 4 below zooms in on functional relationships between the three emotional dimensions and selected semantic properties of interest.

Fig. 4
figure 4

Relationships between the three dimensions and age of acquisition, word frequency, imageability, and sensory experience ratings, presented as scatterplot smoother lowess trend lines

The top left panel of Fig. 4 reveals that early words are maximally positive, strong, and calm. Words become more negative and weak (controlled by) on average as the age of acquisition increases. The peak of arousal is reached in the words learned around the age of 10, while later-acquired words are less exciting. It is tempting to interpret these results as an average developmental timeline of vocabulary acquisition in North American children, with (a) earliest happy and calm words learned in a risk-averse environment protecting a child from negativity and excitement, and (b) excitable words like sexual terms, taboo words, and swear words learned in early school age. Yet it is more likely that the age-of-acquisition patterns of emotional words are at least partly due to how often they occur in English, and thus how likely children are to encounter and learn them early. The top right of Fig. 4 demonstrates that the more frequent a word is, the happier, stronger, and calmer it tends to be. The observed linear relationship between log frequency of occurrence and valence is reasonably strong: The Pearson’s correlation coefficient is .18, and the increase in valence between the least and most frequent words is on the order of two points on the 9-point scale. This corroborates the finding of Garcia, Garas, and Schweitzer (2012) and runs counter to the claim of Kloumann et al. (2012) that the positivity bias in English words is only observed in word types (there are more positive than negative words) and that the correlations between frequency and valence, if any, are corpus-specific and small. The discrepancy may be due to the much broader range of frequency that we consider here, with 14,000 words from the top of the frequency list rather than 5,000 words in each of the corpora considered by Kloumann et al. We leave the verification of the positivity bias over a broader frequency range to further research.

Only highly imageable words are emotionally colored (Fig. 4, bottom left): As imageability increases from rating 5 on the 7-point scale, words become more positive and strong (in control). Again, arousal is distinct from this pattern: Words that are hardly imageable at all or very imageable are calm, while those in the middle of the imageability range increase excitement.

The increasing strength of a sensory experience (Fig. 4, bottom right) varies strongly with arousal: The more tangible the word is, the more exciting it is. This suggests that abstract notions are less powerful in agitating human readers than are material objects. The functional relationship with valence is only observed in the top half of the sensory experience range: More tangible words induce increasingly positive emotions. No reliable relationship is observed between sensory experience ratings and dominance.

Interactions between demographics and ratings

Participants were naturally divided into two genders. In addition, we divided them into two age ranges using the median split—younger (less than 30) and older (30 or greater). We also dichotomized education level into higher (those who had an associate’s degree or greater) and lower (some college or less). All three dimensions showed slightly but significantly higher average ratings for younger versus older and for lower education versus higher education. Also, males gave slightly but reliably higher ratings in all dimensions than did females. Separate independent t tests showed that this difference was significant for valence and arousal, but not for dominance. The means, standard deviations, and independent t test significance levels of each group division are listed in Table 7.

Table 7 Group differences in emotional dimensions

Table 8 reports correlations between groups of participants and demonstrates substantial variability in the ratings that they provided: As with the overall data in Table 5, arousal and dominance elicited less agreement in judgments than did valence.

Table 8 Correlations between groups

We ran a series of multiple regressions looking at age, gender, and education (all dichotomized as described above) as predictors. All main effects were significant at p < .001, and each variable made a unique contribution to the variance in the collected ratings. In addition, most of the two- and three-way interactions for all three dimensions were significant, likely due to the large number of data points available. However, the actual ranges of the effects tended to be small. One exception was the interaction between age and education level for all three dimension (see Fig. 5). For valence and arousal, highly educated people rated words similarly, regardless of age. For those with less education, age strongly affected ratings, with the younger group providing higher ratings, on average, than did the older. For dominance, the opposite pattern held: Age affected those in the higher education group, with older participants providing higher ratings than younger ones, but age did not have an effect in the lower education group.

Fig. 5
figure 5

Interactions between dichotomized education and age levels for all three dimensions. All interactions are significant at p < .001

Gender differences

In what follows, we concentrate on gender differences. Effects of well-established lexical properties on emotion norms varied by gender. Figure 6 presents interactions of gender with frequency of occurrence and age of acquisition as predictors of emotional ratings. All interactions reached significance in multiple regression models, with each set of ratings treated separately as a dependent variable, all ps < .01.

Fig. 6
figure 6

Interactions of gender with frequency (left) and age of acquisition (AoA, right) as predictors of mean ratings of valence (top), arousal (middle), and dominance (bottom). Interactions are presented with gender-specific lowess trend lines

The interactions revealed that female raters provided more extreme negative/weak ratings for the lowest-frequency words, and more extreme positive/strong ratings for higher-frequency words, yielding a broader range of values for both valence and dominance. The same holds for the more extreme ratings given by females to earliest- and latest-learned words, as compared to males.

Quite the opposite pattern was observed in the ratings of arousal (Fig. 6, middle row). Female raters showed a weak relationship between either frequency or age of acquisition and arousal, with slightly higher arousal words in the higher-frequency band and in the mid-range of age of acquisition. Conversely, male raters revealed a strong tendency to find higher-frequency and earlier-learned words as being less exciting than relatively late and infrequent words.

Variability in ratings also varied by gender, see Fig. 7. Male raters disagreed increasingly more on all ratings to higher-frequency words, while variance in ratings by female participants was increasingly attenuated with an increase in word frequency.

Fig. 7
figure 7

Interactions of gender with frequency as a predictor of the standard deviations of ratings of valence (top left), arousal (top right), and dominance (bottom left). Interactions are presented with gender-specific lowess trend lines

While pinning down the origin of these differences will be an issue for further investigation, here we note the necessity for research into emotion words to take into account these interactions as potential sources of systematic error.

Semantic categories

An interesting aspect of emotional ratings is their use to quantify attitudes and opinions toward physical, psychological, and social phenomena either in the population at large or in specific target groups. We showcase here emotional ratings to the semantic categories of disease (Fig. 8) and occupation (Fig. 9), based on Van Overschelde et al.’s (2004) category norms, with occasional additions of semantically similar words. As Fig. 8 suggests, all diseases are rated as words evoking negative feelings, high arousal, and feelings of being controlled; that is, all ratings were below the median of valence/dominance and above the median of arousal in the entire data set (shown as a dotted line). Sexually transmitted diseases were judged as being among the most negative and the most anxiety-provoking entries in the subset. This is generally in line with surveys of attitudes that list sexually transmitted diseases as being among the most stigmatized medical conditions (e.g., Brems, Johnson, Warner, & Roberts, 2010). The most feared medical conditions—cancer, Alzheimer’s, heart disease, and stroke (listed by decreasing percentages of respondents who feared them; MetLife Foundation, 2011; YouGov, 2011)—are also among the most negative, the least controllable, and the most anxiety-provoking diseases.

Fig. 8
figure 8

Ratings of words denoting disease. Dotted lines represent the median ratings of the respective emotional dimensions across the entire data set

Fig. 9
figure 9

Ratings of words denoting occupations. Dotted lines represent the median ratings of the respective emotional dimensions across the entire data set

Ratings of valence to occupations revealed that the best-paying professions in the list were judged as being the most negative, below the median in the overall data set: compare “lawyer,” “dentist,” and “manager.” The correlation between average income, as reported by the Bureau of Labor Statistics (2011), and mean valence is indeed negative, but it does not reach significance (r = −.167, p = .434), possibly due to reduced statistical power (df = 22). Some interesting contrasts can be seen that might prove interesting to social scientists. For example, both the words “police officer” and “firefighter” are rated as highly arousing, but “police officer” is viewed negatively while “firefighter” is viewed positively. In contrast, “librarian” is a positive but completely unarousing occupation term.

Emotional ratings are also a useful tool for studying gender differences in attitudes and beliefs. Figure 10 reports gender differences in ratings to terms denoting weaponry, with the difference between the ratings of female and male responders on the y-axis. The upper parts of the plots in Fig. 10 show words that were given higher valence, arousal, or dominance ratings by female responders; dotted lines represent the no-difference line. Words in blue color stand for items for which the difference in ratings between gender groups reached significance at the p < .01 level in two-tailed independent t tests.

Fig. 10
figure 10

Gender differences in ratings for weapon-related words

All three emotional dimensions showed a significantly greater number of ratings in the lower parts of the plots (all p values in chi-square tests < .01). This indicates that male responders generally have a happier, more aroused, and more in-control attitude toward weapons, especially fire weapons and the bow, for which the gender difference in ratings reached significance.

A similar bias toward higher valence, arousal, and dominance can be observed in ratings of male responders to taboo words and sexual terms. As Figs. 11 and 12 demonstrate, most lexical items in this subset are located below the dotted lines, revealing overall higher ratings for taboo words in male responders (marked in blue if reaching significance) and, in rare cases, in female responders (marked in red if reaching significance). The observed discrepancies in attitudes are corroborated by Janschewitz (2008), Newman, Groom, Handelman, and Pennebaker (2008), and Petersen and Hyde (2010). The discrepancies also explain the disproportionate presence of sexual terms and taboo words among lexical items with exceedingly variable ratings (see the highlighted words in Fig. 2 whose standard deviations are larger than the value predicted from their means).

Fig. 11
figure 11

Gender differences in ratings for taboo words

Fig. 12
figure 12

Gender differences in ratings for sex-related words

General discussion

Technological advances are rapidly changing the tools that language researchers have at their disposal. Two main, complementary developments are (1) the collection of large sets of human data through crowdsourcing platforms and (2) the automatic calculation of word characteristics on the basis of relationships between words. In the former case, the current means of digital communication can be used to reach a large audience at an affordable price. The present study is a typical example of this: Instead of having to limit the list of words to a few hundred, because of a lack of human respondents, we extended the list to nearly 14,000 (see Kuperman et al., 2012, for another example of a large-sample rating obtained via crowdsourcing). Our collection of primary demographic information, such as age, gender, and education, additionally enabled refined analyses of both the central tendency and variability in each of the emotional dimensions. Likewise, it paved the way for characterization of attitudes and opinions in the population at large, as well as in specific groups of respondents.

The derivation of word features by means of counting word co-occurrences is an approach that is likely to expand considerably in the coming years. Arguably, the showcase at the moment is the derivation of word meanings by establishing which words co-occur in texts and bits of discourse. Estimates based on word co-occurrences correlate reasonably well with human-generated word associations and semantic similarity ratings. This approach was initiated by Landauer and Dumais (1997) and Burgess (1998). Recent reviews and extensions can be found in Shaoul and Westbury (2010) and Zhao, Li, and Kohonen (2011). The enterprise critically depends on algorithms that automatically extract word information from collections of texts and calculate various measures of co-occurrence.

Bestgen and Vincze (2012) applied this approach to the affective dimensions of words. They calculated affective norms for over 17,000 words by comparing each word to the thousand words from the ANEW list. The score of each word was derived from the ANEW norms of the words with the closest distance in semantic space. Bestgen and Vincze observed that performance was best when the 30 closest neighbors of the target word were used. This led to correlations of r = .71 between the automatically derived values of valence and the human ratings, r = .56 for arousal, and r = .60 for dominance. All things being equal, these correlations depend on the number of so-called “seed words”—words with known values to which the new words can be compared. The more seed words, the better the estimates for the remaining words. On the other hand, the more seed words for which human data are available, the less the need for automatic extraction of such information. Our extensive data set clearly contributes to the accuracy of such computational estimates. Additionally, it introduces the opportunity to make estimates of textual sentiment for specific reader profiles: for instance, low-educated men, older women, or highly educated youngsters. This in turn may inform the creation of texts that are made more or less emotionally appealing or arousing to specific target populations.

To sum up, our collection of emotion norms for nearly 14,000 words gives computational and experimental researchers of language use a much wider selection of materials for their studies. Depending on the size of a person’s vocabulary, our sample size is estimated to be between one half and one quarter of the words known to individuals. Reliable ratings of the affective states invoked by this number of words will advance the study of the interplay between language and emotion.

Availability

Our ratings are available as supplementary materials for this article and are provided in .csv format. Every value is reported three times: once for each dimension, prefixed with V for valence, A for arousal, and D for dominance. For each word, we report the overall mean (Mean.Sum), standard deviation (SD.Sum), and number of contributing ratings (Rat.Sum). We also report these values for group differences, replacing the suffix .Sum with the following suffixes: .M = male; .F = female; .O = older; .Y = younger; .H = high education; .L = low education. Words are presented in alphabetical order.

We note that group differences (gender, education level, and age), while interesting, are actually quite limited. Taking a conservative p < .01 as our definition of a significant difference, fewer than 100 words per dimension meet this criterion (education and arousal include more, with nearly 200 words each). In terms of gender, the differences seem to occur primarily in categories related to sex, violence, and other taboo topics. When these stereotypical domains are under investigation, we do advise people to consider gender differences in the ratings. The semantic categories for other group differences were more difficult to define. In general, unless there is an already established reason to consider group differences, using the overall. Sum ratings is, we feel, completely valid.