In Plato’s work Cratylus, Socrates and his two interlocutors discuss the relationship between names and their referents. One interlocutor, Hermogenes, argues that the relationship between a name and its referent is simply arbitrary and based on social convention. The other interlocutor, Cratylus, rejects this view and argues that a name has a natural, nonconventional connection with its referent (e.g., perhaps due to certain qualities that the name shares with its referent).
This discussion continued throughout the Greco-Roman world, with many philosophers developing accounts that incorporate features of both Hermogenes and Cratylus (see Frede and Inwood, 2005). In the seventeenth century, John Locke, by contrast, took a more exclusive approach, arguing for a thoroughly Hermogenean view. In An Essay Concerning Human Understanding, Locke writes, “[W]ords being sounds, can produce in us no other simple ideas than of those very sounds; nor excite any in us, but by that voluntary connexion which is known to be between them and those simple ideas which common use has made them the signs of” (III.4.11).
Locke’s exclusively Hermogenean view is also shared by many philosophers today. Michael Rescorla, in his (2011) Stanford Encyclopedia article on convention, writes, “Nowadays, virtually all philosophers side with Hermogenes. Barring a few possible exceptions such as onomatopoeia, the association between a word and its referent is not grounded in the intrinsic nature of either the word or the referent. Rather, the association is arbitrary.”
In opposition to the exclusively Hermogenean view, a new paper by Maia Pujara, Richard Wolf, Michael Koenigs, and myself provides evidence that phonemes (the human speech sounds that constitute words) have an inherent, non-arbitrary emotional quality. Moreover, our data suggest that the perceived emotional valence of certain phoneme combinations depends on a specific acoustic feature—namely, the dynamic shifts within the phonemes’ first two frequency components.
Our study is predicated on a number of previous observations. First, during the production of human phonemes, air within the vocal tract vibrates at several different frequencies simultaneously; each of these frequencies is known as a “formant frequency.” Second, humans’ ability to auditorily differentiate between phonemes is largely mediated by the relative positions and transition patterns of the first two formant frequencies (F1 and F2). Third, numerous nonhuman animal species lower their vocal tracts (thereby lowering the frequencies of their vocalizations) in order to appear larger and more threatening to antagonists or competitors.
Thus, we predicted that strings of phonemes characterized by downward shifts in F1/F2 formants (perhaps evolutionarily rooted in antagonistic/competitive behavior) would be associated with negative emotion, whereas strings of human phonemes characterized by upward shifts in F1/F2 formants (perhaps evolutionarily rooted in conciliative/submissive behavior) would be associated with positive emotion.
To test this prediction, we adapted a two-alternative forced-choice test in which subjects were instructed to match strings of phonemes (comprising nonsense words) with pictures. The non-words were constructed so as to exhibit either an overall upward or downward shift in F1/F2 frequencies. The pictures were selected on the basis of eliciting positive or negative emotion. During the test, consisting of 20 experimental trials, subjects saw a pair of non-words (one with upward F1/F2 shifts, the other with downward F1/F2 shifts), and a pair of pictures (one positive, one negative). Subjects were instructed to mentally sound out each word, and then match each word to one of the two pictures. As predicted, subjects reliably paired the downward F1/F2 shift non-words with the negative images and the upward F1/F2 shift non-words with the positive images (see Figure 1 below). A detailed description of the data and methods can be found in the paper.
These results seem to suggest that (contrary to the exclusively Hermogenean view popular amongst many philosophers) certain strings of phonemes have a non-arbitrary emotional quality, and, moreover, that the emotional quality can be predicted on the basis of specific acoustic features.
Figure 1. A, B: Spectrograms illustrating the first four formants (F1-F4) of the nonsense words “bupaba” and “dugada” as obtained with Praat software (version 5.1.20). When distinguishing between these words, the most salient formant transitions are the F2 transitions from consonants to vowels (outlined in red), which move slightly upward for “bupaba” and downward for “dugada”. C: Example of a trial from the visual task. Subjects pressed a vertical arrow button to match the non-words/pictures vertically, or a horizontal arrow button to match the non-words/pictures horizontally. D: Visual task data (n=32 adult subjects, 15 males, mean age 35.5±15.2). The proportion of individuals selecting a majority of predicted responses (i.e., on more than 10 out of the 20 trials) was significantly greater than expected by chance (Yates’ χ2= 23.8; p=0.000001).