GENERAL AUDITORY MODEL OF ADAPTIVE PERCEPTION OF SPEECH by Antonia David Vitela _____________________ A Dissertation Submitted to the Faculty of the DEPARTMENT OF SPEECH, LANGUAGE, AND HEARING SCIENCES In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2012 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Antonia David Vitela entitled General Auditory Model of Adaptive Perception of Speech and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy _______________________________________________________________________ Date: 11/27/12 Andrew Lotto _______________________________________________________________________ Date: 11/27/12 Brad Story _______________________________________________________________________ Date: 11/27/12 Kate Bunton _______________________________________________________________________ Date: 11/27/12 Stephen Wilson _______________________________________________________________________ Date: 11/27/12 Natasha Warner Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. ________________________________________________ Date: 11/27/12 Dissertation Director: Andrew Lotto 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author. SIGNED: Antonia David Vitela 4 ACKNOWLEDGEMENTS I would like to express my sincere appreciation to the following individuals who greatly contributed to the development and completion of this project: To Andrew Lotto, my advisor, for making the world of academia seem magical and inspiring, for teaching me the importance of a good story, and for the constant support and advice over the past six years. To my committee members, for all of their guidance, encouragement, insights, criticisms, and help along the way. To the ACNE Lab, for all of the support, the experiences, and the friendships made. To the Speech, Language, and Hearing Sciences Department for being my academic home for the past 10 years. To my family and Jay, for all of their love and support. This work was supported by the National Institutes of Health: F31 DC011698. 5 DEDICATION To Ladefoged and Broadbent (1957) “Please say what this word is” For paving the way and inspiring so many researchers, including me. 6 TABLE OF CONTENTS LIST OF FIGURES ...........................................................................................................10 LIST OF TABLES .............................................................................................................14 ABSTRACT .......................................................................................................................15 CHAPTER 1 INTRODUCTION .......................................................................................17 1.1 Acoustic Variability in the Speech Signal ............................................................17 1.1.1 Contextual Influences on Phonemic Representations .......................................18 1.1.2 Within-Talker Variability .................................................................................24 1.1.3 Between-Talker Variability ..............................................................................26 1.2 Traditional Theories of Talker Normalization....................................................29 1.2.1 Intrinsic Theories ..............................................................................................30 1.2.2 Extrinsic Theories .............................................................................................34 1.2.3 Other Approaches .............................................................................................40 1.3 Long-term Average Spectrum (LTAS) Model ....................................................42 CHAPTER 2 GENERAL AUDITORY MODEL..............................................................49 2.1 Limitations of Previous Stimuli ............................................................................49 2.2 TubeTalker .............................................................................................................50 2.2.1 TubeTalker Description ....................................................................................51 2.2.2 Development of the TubeTalker .......................................................................53 2.2.3 Creating Different Talkers ................................................................................54 2.3 General Auditory Model .......................................................................................56 CHAPTER 3 GENERAL METHODOLOGY AND PRELIMINARY STUDIES ...........61 3.1 Predictions ..............................................................................................................61 3.2 General Methodology ............................................................................................63 3.2.1 Participants ........................................................................................................63 3.2.2 Stimuli ...............................................................................................................63 3.2.3 Procedure ..........................................................................................................65 3.2.4 Predictions.........................................................................................................66 3.2.4.1 Deriving the LTAS ....................................................................................69 3.2.5 Analysis Methods..............................................................................................71 3.3 Prediction #1: Shifts in perception of a target phoneme can be elicited when the LTAS of preceding contexts predict opposite effects on the target ..................72 3.3.1 Preliminary Experiment 1 .................................................................................72 3.3.2 Stimuli and Predictions .....................................................................................72 3.3.3 Participants and Procedure ................................................................................73 3.3.4 Results and Discussion .....................................................................................73 7 TABLE OF CONTENTS – Continued 3.4 Prediction #2: Only those talker differences resulting in a change in LTAS that corresponds to a differential contrast of an important phonetic cue will results in a shift in perception .....................................................................................................74 3.4.1 Preliminary Experiment 2 .................................................................................74 3.4.2 Stimuli and Predictions .....................................................................................74 3.4.3 Participants and Procedure ................................................................................75 3.3.4 Results and Discussion .....................................................................................76 3.5 Prediction #3: Backwards stimuli should cause the same effect as the same stimuli played forwards ...............................................................................................77 3.5.1 Preliminary Experiment 3 .................................................................................77 3.5.2 Stimuli and Predictions .....................................................................................78 3.5.3 Participants and Procedure ................................................................................78 3.5.4 Results and Discussion .....................................................................................78 3.6 Prediction #4: The LTAS that is calculated is an extraction of the neutral vowel; therefore, a talker’s neutral vowel should have the same effect on perception as a phrase produced by that talker........................................................80 3.6.1 Preliminary Experiment 4 .................................................................................80 3.6.2 Stimuli and Predictions .....................................................................................81 3.6.3 Participants and Procedure ................................................................................82 3.6.4 Results and Discussion .....................................................................................82 3.7 General Discussion .................................................................................................84 CHAPTER 4 THE TEMPORAL WINDOW ....................................................................87 4.1 Temporal Window .................................................................................................87 4.2 Experiment 1 ..........................................................................................................88 4.2.1 Methods.............................................................................................................89 4.2.1.1 Stimuli and Predictions ..............................................................................89 4.2.1.2 Participants and Procedure .........................................................................94 4.2.2 Results ...............................................................................................................94 4.2.3 Discussion .........................................................................................................95 4.3 Experiment 2 ..........................................................................................................97 4.3.1 Methods.............................................................................................................98 4.3.1.1 Stimuli and Predictions ..............................................................................98 4.3.1.2 Participants and Procedure .........................................................................99 4.3.2 Results .............................................................................................................100 4.3.3 Discussion .......................................................................................................102 4.4 Experiment 3 ........................................................................................................103 8 TABLE OF CONTENTS – Continued 4.4.1 Methods...........................................................................................................103 4.4.1.1 Stimuli and Predictions ............................................................................103 4.4.1.2 Participants and Procedure .......................................................................107 4.4.2 Results .............................................................................................................108 4.4.3 Discussion .......................................................................................................108 4.5 General Discussion ...............................................................................................108 CHAPTER 5 EXTRACTING THE NEUTRAL VOWEL ..............................................113 5.1. Extracting the Neutral Vocal Tract...................................................................113 5.2 Experiment 4 ........................................................................................................114 5.2.1 Methods...........................................................................................................115 5.2.1.1 Signals and Predictions ............................................................................115 5.2.1.2 Procedure .................................................................................................117 5.2.2 Results .............................................................................................................117 5.2.3 Discussion .......................................................................................................122 5.3 Experiment 5 ........................................................................................................123 5.3.1 Methods...........................................................................................................124 5.3.1.1 Signals and Predictions ............................................................................124 5.3.1.2 Procedure .................................................................................................126 5.3.2 Results .............................................................................................................126 5.3.3 Discussion .......................................................................................................127 5.4 Experiment 6 ........................................................................................................132 5.4.1 Methods...........................................................................................................132 5.4.1.1 Stimuli and Predictions ............................................................................132 5.4.1.2 Participants and Procedure .......................................................................133 5.4.2 Results .............................................................................................................134 5.4.3 Discussion .......................................................................................................134 5.5 General Discussion ...............................................................................................135 CHAPTER 6 HARMONIC COMPLEX EFFECTS ........................................................139 6.1 Harmonic Complex Effects .................................................................................139 6.2 Experiment 7 ........................................................................................................140 6.2.1 Methods...........................................................................................................141 6.2.1.1 Stimuli and Predictions ............................................................................141 6.2.1.2 Participants and Procedure .......................................................................142 6.2.2 Results .............................................................................................................143 6.2.3 Discussion .......................................................................................................144 6.3 Experiment 8 ........................................................................................................146 9 TABLE OF CONTENTS – Continued 6.3.1 Methods...........................................................................................................146 6.3.1.1 Stimuli and Predictions ............................................................................146 6.3.1.2 Participants and Procedure .......................................................................148 6.3.2 Results .............................................................................................................148 6.3.3 Discussion .......................................................................................................150 6.4 Experiment 9 ........................................................................................................150 6.4.1 Methods...........................................................................................................150 6.4.1.1 Stimuli and Predictions ............................................................................150 6.4.1.2 Participants and Procedure .......................................................................152 6.4.2 Results .............................................................................................................152 6.4.3 Discussion .......................................................................................................154 6.5 General Discussion ...............................................................................................154 CHAPTER 7 DISCUSSION AND CONCLUSIONS .....................................................156 7.1 Overview Summary .............................................................................................156 7.2 Specific Aim #1: Determine the temporal window over which the LTAS is Calculated ...................................................................................................................160 7.2.1 Interpretation and Future Research ................................................................161 7.3 Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract ...................................................................................................................162 7.3.1 Interpretation and Future Research .................................................................164 7.4 Determine what aspects of the comparisons of the LTAS and the target predict shifts in target categorization ......................................................................166 7.4.1 Interpretation and Research ............................................................................167 APPENDIX A. AMBIGUOUS TARGET SELECTION ................................................173 APPENDIX B. FILTERED EXPERIMENTS.................................................................174 REFERENCES ................................................................................................................176 10 LIST OF FIGURES Figure 1 Spectrograms of productions of /bub/, /dud/, /bɅb/, and /dɅd/ .........................20 Figure 2 F1xF2 vowel space for three synthesized talkers ...............................................27 Figure 3 A pseudo-midsagittal view of the neutral vocal tract shapes for the five “talkers” created from the TubeTalker ..............................................................................55 Figure 4 F1xF2 vowel space adjusted for the neutral vowel ............................................59 Figure 5 Spectrograms of the endpoints of the the /dɑ/-/gɑ/ target series ........................64 Figure 6 F1xF2 vowel space of the vowel portion of the 12-step “bit” to “bet” target series ..................................................................................................................................65 Figure 7 An example of the computer screen participants see during experimentation ...67 Figure 8 LTAS comparison between the carrier phrases “He had a rabbit and a” of the LongVT and ShortVT and the ambiguous target from the “bit” to “bet” series................69 Figure 9 LTAS comparison between the carrier phrases “Abracadabra” of the LongVT and ShortVT and the ambiguous target from the “da” to “ga” series ................................73 Figure 10 Results for Preliminary Experiment 1 ..............................................................74 Figure 11 LTAS comparison between the carrier phrases “He had a rabbit and a” of the O+P-VT and O-P+VT and the ambiguous target from the “bit” to “bet” series ...............75 Figure 12 Results for Preliminary Experiment 2 (LongVT vs. ShortVT) ....................... 76 Figure 13 Results for Preliminary Experiment 2 (O-P+ vs. O+P-) ..................................76 Figure 14 Results for Preliminary Experiment 3 (Forwards) .......................................... 79 Figure 15 Results for Preliminary Experiment (Backwards) ...........................................79 11 LIST OF FIGURES - Continued Figure 16 LTAS comparison between the carrier phrases “Please say what this word is” and the neutral vowel of only the ShortVT, along with the ambiguous target from the “da” to “ga” ....................................................................................................................... 81 Figure 17 Results for Preliminary Experiment 4 (Forwards) .......................................... 83 Figure 18 Results for Preliminary Experiment 4 (Backwards)....................................... 83 Figure 19 Waveform and spectrogram of the LongVT’s production of “He had a rabbit and a” for the stimuli used in Experiment 1 ......................................................................90 Figure 20 LTAS comparisons for the stimuli in Experiment 1 and Experiment 2 ...........92 Figure 21 Results for Experiment 1 ..................................................................................96 Figure 22 Waveform and spectrogram of the LongVT’s production of “He had a rabbit and a” for the stimuli used in Experiment 2 .....................................................................99 Figure 23 Results for Experiment 2 ................................................................................101 Figure 24 LTAS comparisons for the stimuli in Experiment 3 ......................................105 Figure 25 Results for Experiment 3 ................................................................................107 Figure 26 LPC comparisons between Female1’s neutral vowel and different sentence segments ..........................................................................................................................118 Figure 27 LPC comparisons between Female2’s neutral vowel and different sentence segments ...........................................................................................................................119 Figure 28 LPC comparisons between Male1’s neutral vowel and different sentence segments ...........................................................................................................................120 12 LIST OF FIGURES - Continued Figure 29 LPC comparisons between Male2’s neutral vowel and different sentence segments ...........................................................................................................................121 Figure 30 The vowel trajectory synthesized from the LongVT .....................................125 Figure 31 LPC comparisons of the neutral vowel and the vowel trajectory segments synthesized from LongVT ...............................................................................................128 Figure 32 LPC comparisons of the neutral vowel and the vowel trajectory segments synthesized from the ShortVT .........................................................................................129 Figure 33 LPC comparisons of the neutral vowel of one TubeTalker compared to the full vowel trajectory of the other TubeTalker .......................................................................130 Figure 34 LPC comparisons of the ShortVT, LongVT, and ambiguous target with both the synthesized neutral vowel and the vowel trajectory ..................................................133 Figure 35 Results for Experiments 5 ..............................................................................134 Figure 36 The spectra of three of the harmonic complexs and the ambiguous target from the “bit” to “bet” series ....................................................................................................142 Figure 37 Results for Experiment 7 ................................................................................143 Figure 38 Experiment 7 individual differences for each harmonic complex..................145 Figure 39 Spectra of the ambiguous target and the harmonic complex with a highamplitude peak at 700 Hz in both the matched amplitude condition and the high amplitude condition .........................................................................................................147 Figure 40 Results for Experiments 8 ..............................................................................148 Figure 41 Experiment 8 individual differences for the matched peak condition ............149 13 LIST OF FIGURES - Continued Figure 42 Experiment 8 individual differences for the high peak condition ..................149 Figure 43 Spectra of the ambiguous target and the harmonic complex with a trough at 700 Hz in both the trough and the large trough condition conditions .............................151 Figure 44 Results for Experiment 9 ................................................................................152 Figure 45 Experiment 9 individual differences for the small trough condition .............153 Figure 46 Experiment 9 individual differences for the large trough condition ..............153 14 LIST OF TABLES Table 1 A summary of the stimuli used in each of the preliminary studies......................67 Table 2 A list of the sentences recorded from the four participants and the order in which each of the sentences was appended for each talker to make the stimuli ........................116 Table 3 The duration of each stimulus used in the LTAS comparisons with the neutral vowel from each talker.....................................................................................................116 Table 4 Results of Experiment 4 .....................................................................................117 Table 5 Results of Experiment 5 .....................................................................................127 15 ABSTRACT One of the fundamental challenges for communication by speech is the variability in speech production/acoustics. Talkers vary in the size and shape of their vocal tract, in dialect, and in speaking mannerisms. These differences all impact the acoustic output. Despite this lack of invariance in the acoustic signal, listeners can correctly perceive the speech of many different talkers. This ability to adapt one’s perception to the particular acoustic structure of a talker has been investigated for over fifty years. The prevailing explanation for this phenomenon is that listeners construct talker-specific representations that can serve as referents for subsequent speech sounds. Specifically, it is thought that listeners may either be creating mappings between acoustics and phonemes or extracting the vocal tract anatomy and shape for each individual talker. This research focuses on an alternative explanation. A separate line of work has demonstrated that much of the variance between talkers’ productions can be captured in their neutral vocal tract shape (that is, the average shape of their vocal tract across multiple vowel productions). The current model tested is that listeners compute an average spectrum (long term average spectrum – LTAS) of a talker’s speech and use it as a referent. If this LTAS resembles the acoustic output of the neutral vocal tract shape – the neutral vowel – then it could accommodate some of the talker based variability. The LTAS model results in four main hypotheses: 1) during carrier phrases, listeners compute an LTAS for the talker; 2) this LTAS resembles the spectrum of the neutral vowel; 3) listeners represent subsequent targets relative to this LTAS referent; 4) such a representation reduces talker-specific 16 acoustic variability. The goal of this project was to further develop and test the predictions arising from these hypotheses. Results suggest that the LTAS model needs to be further investigated, as the simple model proposed does not explain the effects found across all studies. 17 CHAPTER 1 INTRODUCTION 1.1 Acoustic Variability in the Speech Signal In the early 1940s, two researchers (Al Liberman and Frank Cooper) set out to create a reading machine for the blind. At the time, they assumed that speech was simply a learned one-to-one mapping from sound to consonant/vowel identity and, therefore, they could create their own map for a listener to learn by ascribing a distinctive sound to each consonant/vowel. All they would need was to create a machine that could scan the letters of a text and for each individual letter, produce its analogous new sound. Then, they would train the blind reader to know which sounds corresponded to which letters and voilà! The blind person would be able to listen to any written text of their choosing. While they had felt this task was accomplishable and straightforward, they quickly discovered that the perception of speech sounds is a far more complex process than they had believed. They were unable to find sounds that listeners could learn and understand at a rate comparable to that of true readers (their best participants plateaued at a fifth-grade “reading” level), so they began to look to real speech for answers. Specifically, they set out to discover the acoustic patterns and cues that identify speech sounds (phonemes), hoping this information could help in the creation of more listener-friendly sounds for the reading machine. Their search spawned an entire field, because they never found these distinctive acoustic patterns. It turned out that there was a great amount of acoustic 18 variability in the speech signal. There were no specific patterns that identified each phoneme; in fact, different acoustic patterns could lead to the same phonemic percept and the same acoustic patterns could lead to different phonemic percepts, depending on the surrounding phonemic context. This problem has been referred to as the “lack of invariance” and, while some researchers continue to develop theories about invariant cues (Blumstein & Stevens, 1979; 1980; 1981), others believe that listeners must be somehow compensating for the variance via additional processing or special mechanisms (these will be discussed in Section 1.2). (See Liberman (1996) for a full account of the reading machine and the search for invariant cues in speech perception.) The acoustic variability inherent in the realizations of phonemes comes from at least three major sources. The first is the surrounding context of any particular phoneme (how the phonemes that precede and follow a target sound can affect the perception of that sound). The second is within-talker changes in speaking style (how clear vs. conversational speech can change the speech signal of the same talker). And, the last is between-talker variability (e.g., how specific differences in anatomy, physiology, and accents between talkers affect the acoustic output). Each of these sources of variability influences the speech signal in parallel, but has been studied separately and so will be discussed separately below. 1.1.1 Contextual Influences on Phonemic Representations For this section, let us pretend that there is but one speech apparatus that is exactly the same (in size, shape, source, articulatory patterns, etc.,) and that, even across genders, every human has this one structure from which to produce speech. From 19 productions of this one vocal tract, we hope to analyze and find the specific speech patterns or cues for every phoneme. Recordings of consonant-vowel-consonant (CVC) combinations are made and the spectrograms of each production are perused for invariant speech patterns. The most common aspect of the signal to focus on is the formants (resonances of the vocal tract shape). These peaks or areas of energy are the most salient in a spectrum or spectrogram and are apparent in the neural encoding of steady-state vowels (Sachs & Young, 1979), so the patterns and transitions that they create may be the auditory system’s key to phonemic categorization. Figure 1 shows spectrograms for two pairs of words (each pair shares the same vowel): /bub/, /dud/, /bɅb/, and /dɅd/. They were recorded from a male speaker. The formants are tracked by the dots with the first two labeled. Perceptually, the vowel in each pair of words would be identified as the same (either /u/ or /Ʌ/), but note the differences in formant frequency. For /bub/, the vowel’s F1 is at 341 Hz and F2 is at 985 Hz, but for /dud/, the vowel’s F1 is at 265 Hz and F2 is at 1312. The second formant is much higher in the d-vowel-d context than in the bvowel-b context, while the first formant is lower. Yet, again, both of these vowels are perceived as /u/. For /bɅb/, the vowel’s F1 is at 684 Hz and the F2 is at 1213 Hz. For /dɅd/, the vowel’s F1 is at 616, and the F2 is at 1414. Again, the second formant is higher in the dVd context and the first formant is lower. (All formant measurements were taken at the steady-state portion of the vowel: /bub/ = .126 msec, /dud/ = .164 msec, /bɅb/ = .095 msec, and /dɅd/ = .097 msec.) 20 Figure 1. Spectrograms of productions of /bub/, /dud/, /bɅb/, and /dɅd/. The dots track the location of the formants. The first two formants are labeled. 21 Lindblom (1963) attributed these formant differences to “undershoot.” Essentially, talkers do not reach their articulatory goal when speaking at a certain rate. This “undershoot” of the articulatory goal leads to differences in the formants, because the articulatory target for a certain vowel may be closer or further away in certain consonantal contexts. That is, if a consonant’s and vowel’s articulatory targets are close, the vowel’s formant will be closer to its characteristic frequencies than if the two targets are further away from one another. This leads to variation in the frequencies of each formant. In any case, the surrounding context of a vowel directly affects the formant patterns so that the formant frequencies of a vowel will not be the same across different phonemic contexts. This means that there is no fixed formant pattern that identifies each individual vowel sound. Listeners cannot simply use a template approach to match a specific formant pattern to a specific vowel, since this would lead to misidentifications, depending on the consonant context that surrounds the vowel. Context-specific changes in acoustics have been studied over a wide range of different phonemes. Hillenbrand, Clark, and Nearey (2001) noted significant changes in formant patterns for eight different vowels (/i, ɪ, ɛ, æ, ɑ, ʊ, u, ʌ/) across 42 possible consonant initial (/h, b, d, g, p, t, k/) and final (/b, d, g, p, t, k/) combinations when the patterns were compared to the isolated vowels’ formant patterns. The most substantial shift was found when a rounded vowel was preceded by an alveolar (e.g., “dude”). These effects have also been shown with consonants (Delattre, Liberman, & Cooper, 1955; Malecot, 1956; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Blumstein & Stevens, 1979) and in other languages (Gottfried, 1984, Recasens, 1985; Eposito, 22 2002; Strange, et al., 2007; Chen, 2008). With this amount of variation in the speech signal, it does not seem that listeners could simply learn patterns for each context, since there do not appear to be clear patterns that identify each phoneme. The task of the listener would seem quite daunting. Somehow, a listener must deal with the variability in the speech signal to correctly perceive speech or to simply identify vowel sounds. Lindblom and Studdert-Kennedy (1967) examined whether listeners’ perceptions of vowels were context-sensitive, as would seem necessary given the context specific nature of the acoustics. First, they synthesized a 20-step /ɪ/ to /ʊ/ vowel series by manipulating the second and third formant in equal steps. Then, they created two CVC (consonant-vowel-consonant) series (or, rather, glide-vowel-glide series) with /jVj/ and /wVw/ providing the surrounding context. These particular glides were chosen to ensure that the formant transitions leading to the vowel portion of the stimulus moved in opposite directions. The second and third formant transitions for /j/ started high and then sloped downward to the vowel portion, whereas formants for /w/ started low and sloped upward. A two-alternative forced-choice task was utilized, in which listeners were asked to identify the vowel as either the /ɪ/ in “bit” or as the /ʊ/ in “book” with presentations in isolation or one of the CVC contexts. The authors posited two potential outcomes: (1) Listeners’ responses would be the same for stimuli that had identical formant positions at the midpoints (the most steady-state portion of the vowel); or (2) Listeners’ responses would vary as a function of the context. The results showed the latter. Relative to the phonemic boundary of the vowels in isolation, more /ɪ/ responses were made in the /wVw/ context and more /ʊ/ responses were made in the /jVj/ 23 context. Thus, the study demonstrated that vowel perception was context sensitive. This effect was replicated in a series of studies by Williams (1986). Nearey (1989) followed up on this line of work to see if the effect would extend to other contexts, since glide-vowel-glide syllables are not common in English. Using the paradigm of Lindblom and Studdert-Kennedy (1967), Nearey created an isolated vowel series, moving from /ɒ/ to /ɛ/ to /ʌ/ (vowels of Western Canadian English) and then, used these vowels to make two different CVC series, /bVb/ and /dVd/. His findings provide further evidence that the context can elicit a perceptual shift in vowel categorization. When comparing isolation to the /dVd/ context, the boundary was shifted towards more /ɒ/ responses for the /ɒ/ to /ʌ/ series and towards more /ʌ/ responses for the /ʌ/ to /ɛ/ series. Both boundaries shifted in the opposite direction for the /bVb/ context (but the shift was only significant for the /ɒ/ to /ʌ/ series). The /ʌ/ to /ɛ/ effect was also replicated by Holt, Lotto, and Kluender (2000). In summary, analyses of speech CVCs have revealed that formant frequencies for vowels are affected by the surrounding phonemes. Perceptually, different formant frequencies can elicit the same vowel identification and the same formant frequencies can elicit different vowel identification based on the context that precedes and follows the target. This indicates that even if every person had the same vocal tract, the lack of invariance problem would still be present. Most surprising is that listeners are capable of compensating for all of this variability in order to correctly perceive the intended message of the speaker. As impressive as this is, context-sensitive variability is only one source of acoustic variability. 24 1.1.2 Within-Talker Variability In the previous section, evidence was provided to show that even if every human had the same vocal tract, there would still be variability in the speech signal. Even that assumption is more complex than was shown previously, because the same speaker does not always choose to speak in the same way. That is, there are different speaking styles that a talker can choose to use depending on the context of the situation. One type of speaking style is known as clear speech (or “read” or “laboratory” speech). Typical studies that have looked at formant values and the possible defining cues for different vowels run these analyses on clear speech. Typical speakers, however, do not use this type of speech in regular, everyday conversations but use “conversational” or “reduced” speech. In fact, Lindblom (1990) suggests that changes in speaking style actually should be considered a continuum from hypo-articulated to hyper-articulated speech (H&H Theory). The speaker, essentially, chooses where they fall along this continuum based on the listener. If the topic of conversation is one that the listener is familiar with, the speaker will choose to speak in a more relaxed manner, falling on the hypo or reduced side of the continuum. If the topic is not as well known to the listener, the speaker will choose to speak more clearly, falling on the hyper or clear side of the continuum. Even though there are two extremes to this continuum, the speaker could choose to speak at any place in-between and could also vary their style during the conversation as the topic changes or even between old and new information in the same sentence. It is a balance between providing a clear message and minimizing articulatory effort. 25 Several studies have been carried out to characterize the acoustic changes that accompany shifts in speaking style. In particular, there has been an in-depth description of the results of a talker trying to produce “clear” intelligible speech (e.g., Picheny, Durlach & Braida, 1986; Moon & Lindblom, 1994; Bradlow, Kraus & Hayes, 2003; Lotto, Ide-Helvie, McCleary & Higgins, 2006). The consequences of clear speech production include increased segment durations, greater f0 ranges, and shifts in vowel formant frequencies resulting in an expanded vowel space (Chen, 1980; Picheny et al., 1986; Moon & Lindblom, 1994; Ferguson & Kewley-Port, 2002; Bradlow et al,. 2003). It should be noted that in most of these studies the comparisons are between very “clear” speech, such as one may produce when speaking to an interlocutor who does not share the native language and the relatively clear speech that is normally produced in laboratory recordings. That is, much of the research has been conducted on the “Hyper” end of Lindblom’s H&H continuum. Recently, there has been a growth in interest in the hypoarticulated, reduced speech that is probably more characteristic of typical conversational speech (e.g., Greenberg, 1999; Ernestus, 2000; Johnson, 2004; Warner, Fountain & Tucker, 2009). This speech is typified by alterations and/or deletions of segments or productions that are not articulated in their canonical form (Johnson, 2004; Pluymakers, Ernestus, & Baayen, 2005; Warner, 2005; Dilley & Pitt, 2007; Nakamura, Iwano & Furui, 2007;). Remarkably, many of the cues that researchers have focused on to solve the problem of the “lack of invariance” may not even be present in the speech signal during typical conversation! Listeners’ ability to correctly perceive speech is quite 26 amazing, but even this is a simplification of the problem. There are still differences between talkers that need to be taken into account. 1.1.3 Between-Talker Variability If predicting the acoustic output of a specific speaker did not seem complex enough, between-talker variability must be considered. Obviously, every human does not have the same vocal tract. Every human has a distinct vocal tract shape with inherent differences, ranging from speaker-specific idiosyncrasies to more general differences due to gender or stage of development. On top of that, talkers’ speech characteristics are a product of their geographical location and socioeconomic status, each with its own accent or dialectal variation. Essentially, everyone’s productions are a unique amalgamation of their physiology and background. This “uniqueness” was demonstrated in a classic study by Peterson and Barney (1952). They recruited 33 men, 28 women, and 15 children from the Mid-Atlantic region of the United States and recorded them producing ten vowels in an /hVd/ context. For each gender, they analyzed the first three formants and f0 and plotted each vowel in an F1xF2 space. The measurements showed a great amount of variability across and within each group, and the vowel space plots showed large overlap across different vowel categories. There were some trends, though. For instance, the f0 and formant frequencies of male productions were the lowest and the children’s productions were the highest with female productions falling in between. 27 It may seem obvious that there would be systematic differences in formant frequency measurements across the three groups, as a direct result of the length of the vocal tract. For example, Figure 2 shows the F1xF2 vowel space for three different synthesized talkers who vary only in length, while all other parameters (e.g., source, basic vocal tract shape, articulation pattern) are constant. These talkers were created using a vocal tract model that is based on measurements of vocal tract area shapes (the model is described in Chapter 2; Story, 2005a). The standard vocal tract is representative of a typical male speaker with a vocal tract length of 17.5 cm from glottis to lips. The short vocal tract is 20% shorter (14 cm) and the long vocal tract is 20% longer (21 cm). Figure 2 shows that, even in this constrained case, there is a great deal of variability and overlap across and within the vowel categories for these three “talkers.” Figure 2. F1xF2 vowel space for three synthesized talkers who vary only in vocal tract length (Standard = 17.5 cm, Long = 21 cm, and Short = 14 cm). 28 This problem of vocal tract variation becomes even more complex when gender and age are allowed to vary. Children’s vocal tracts are not just shorter than adults’ but are configured differently. The almost 90-degree bend common to adult vocal tracts develops over time (infants’ vocal tracts are initially structured in a way that allows them to feed and breathe at the same time). This and other structures of the vocal tract mature at different rates so that children are learning to manipulate an ever-changing vocal apparatus. Then, at puberty, children’s vocal tracts change more drastically depending on their gender. In particular, the pharyngeal region of a male vocal tract grows significantly longer than that of a female (Kahane, 1982; Fitch & Geidd, 1999; Vorperian, Kent, Lindstrom, Kalina, Gentry, & Yandell, 2005; Vorperian, Wang, Chung, Schimek, Durtschi, Kent, Ziegert, & Gentry, 2009). There is also evidence that younger and older adults differ in vocal tract size and basic shape in terms of the ratio of their oral to pharyngeal cavities (Kahane, 1980; Xue & Hao, 2003). Thus, not only are there inherent differences in speaker’s anatomy but the anatomy is constantly changing over the lifespan. Moreover, there are regional dialects and accents that vary across and within nationalities. In regards to North-American English, entire books have been devoted to regional variability in the acoustics of speech production (Krapp, 1925; Kurath, 1939; Frazer, 1993; Thomas, 2001; Labov, Ash, & Boberg, 2006). Vowel spaces based on F1 and F2 frequencies make indicate that there is a great deal of variability across and even within dialects (Clopper, Pisoni, & de Jong, 2005; Jacewicz, Fox, & Salmons, 2007; Neel, 2008). Complicating the issue more is that these dialects and accents are not stable 29 but change over time. For instance, vowel shifts are a type of historical sound change that occur when speakers of a specific geographical region shift or change the way in which they produce vowels. One example is the California vowel shift. Compared to vowel recordings taken in the 1950s, the back vowels /u/, /ʊ/, and /o/ are becoming more fronted and less rounded when produced by younger adults (Hinton, Moonwomon, Bremner, Luthin, Van Clay, Lerner, & Corcoran, 1987). In short, there is a large amount of variation across talkers that affect the acoustic output of any single talker. Yet, listeners are able to correctly perceive the intended message of the speaker. Despite all of the variability caused by the surrounding phonemic context, by differences in the anatomy of the vocal tract, by accents/dialects, and by speaking styles, listeners are somehow able to compensate for the differences between talkers. The process by which this happens is known as “talker normalization.” Listeners are able to “normalize” or adjust their perception and tune to the particularities of an individual speaker so that they can accurately identify phonemes across a wide range of talkers. The following section lays out the predominant theories on how this process works. 1.2 Traditional Theories of Talker Normalization Two main mechanisms have been proposed to explain the process of talker normalization: intrinsic and extrinsic. These two differ in what they consider the “source” of the information for accurate vowel perception. That is, the theories do not agree on which part of the speech signal holds the necessary acoustic information that listeners are using to understand different talkers. Intrinsic theories assume that the source is within the segment of the acoustic signal for that phoneme – that there is a one- 30 to-one mapping from properties of the signal to phoneme perception, even if these properties are difficult to find. The other group of theories is known as extrinsic theories. These maintain that while a speaker is talking, the listener is learning information specific to that speaker. This learned knowledge is the source and helps the listener understand subsequent productions from that speaker. The research project described in this dissertation concerns information available to the listener that is extrinsic to the target vowel. (The terms intrinsic and extrinsic were first coined by Ainsworth (1975).) 1.2.1 Intrinsic Theories Intrinsic theories of talker normalization assume that the important cues for identifying specific vowel sounds lie within the portion of the signal that corresponds to the phoneme of interest. These cues are broadly defined and can be any acoustic information that lies within the vowel portion (including transitions). In most theories, the cues are a mathematical transformation of the values of a combination of the formants and sometimes the fundamental frequency. However, the raw frequency values are unlikely candidates for perception, because of the encoding of these frequencies on the basilar membrane is non-linear (Rhode, 1971; Robles, Rhode, & Geisler, 1976). It is tempting to suggest, then, that that these inherent non-linearities in the auditory system may actually provide some “normalization” across talkers, since they modify the incoming acoustic signal in a specific way. So instead of using raw values, researchers tend to use psychophysical scales (those based on experiments that have tested listener’s pitch perception). In these studies, listeners are asked to judge whether two tones have the same or different pitches, and it has been found that the higher the frequency of the 31 tones, the more difference there needs to be between them for listeners to notice a difference. There are numerous scales that take this into account (e.g., Koenig, mel, Bark, Equivalent Rectangular Bandwidth). These scales are seen to be more representative of how frequencies are encoded and so most intrinsic theories employ them, rather than using the absolute frequency of the formant peaks. The first theory of this type was described by R.J. Lloyd in a series of manuscripts published from 1890 to 1901. Notably, he was one of the first researchers to state the importance of the first two formant frequencies for vowel identification. And, in opposition to current views at the time, Lloyd felt that the absolute value of these formants’ frequencies was not what distinguished vowels but that the relative value (or ratio) between formant peaks were essential to identifying vowels. He felt that this had to be so, since there was inherent variability in the physiology of the vocal tracts across different speakers, which would invariably lead to differences in the resonant properties of their vocal tracts. The ratio between the two formants, therefore, had to be the constant cue that helped the listener to perceive the correct vowel sound (See MacMahon (2007) for a summary of R.J. Lloyd’s work). Since Lloyd’s time, other researchers have proposed similar theories based on formant ratios. Some have focused more on the perceptual side by manipulating recordings of vowel productions to observe the effect that this has on listeners’ perception. For example, Chiba and Kajiyama (1958) lowered or raised the sampling rate of vowel recordings during playback. They found that a specific stimulus would be identified as the same vowel, as long as each of its formants fell within a specific 32 frequency range. This kept the ratios relatively constant, which they claimed led to the constant perception of the listener. Other researchers have taken an ideal-observer approach – determining the acoustic cues that lead to optimal classification performance. Typically, these studies identify the frequency of the first three formants and, then, plot the ratio between F1/F2 on one axis and the ratio between F2/F3 on the other axis (which looks very similar to Figure 2) in the hopes that this will clean up the space and clearly discriminate between vowel groups. In either case, it is assumed that the ratio remains constant and so creates a specific frequency pattern. This pattern is what is important and unique for each vowel sound. Vowels are, therefore, identified via this pattern of stimulation that would be represented on the basilar membrane, even if the starting point of the pattern was displaced in frequency. However, formant ratios are not successful at categorizing all vowel groups. For instance, Potter and Steinberg (1950) found that while the F1/F2 ratio was successful in discriminating between front vowels, it was not very successful with back vowels. Because of this, researchers have attempted to update the formant ratio theory. Peterson (1952; 1961) tried to add a third dimension that could further disambiguate the vowel space. In addition to the formant ratios, he used measurements of amplitude differences and then, also tried to use measurements of the fundamental frequency. Although these added to the discrimination ability for some of the vowels, neither was completely successful with the entire vowel set used in the analysis. Katz and Assmann (2001) ran a study just on fundamental frequency, finding that its relationship with vowel identification was not as straightforward as hoped. Halberstam and Raphael (2004) came 33 to the same conclusion after running a number of perceptual studies investigating the importance of the fundamental frequency and the third formant for vowel identification. Miller (1989) actually proposed a new, more complex, formant ratio theory with a threestage process that converted the acoustic signal into phonetic categories. This was done by taking into account an additional variable that he referred to as the “sensory reference.” This reference point takes into account the average pitch of all talkers, the average pitch of the current talker, the ratio between the fundamental frequency of the talker and the first formant of the vowel that is being perceived, and the fluctuations in pitch that can occur throughout a production. Miller (1989) found that by including this, there was less overlap between the back vowels that were of concern in the previous papers. Researchers have also started to look for evidence of intrinsic normalization based on knowledge of the discharge patterns of different types of neurons. Sussman (1986) proposed a model of vowel perception based on research demonstrating that mustached bats have neurons sensitive to different frequency combinations. If humans have similar neurons that respond maximally to the relationship between different frequencies, these could provide evidence in support of formant ratios. More recently, researchers have started to record human neural responses to different frequency combinations to see if the brain is sensitive to formant ratios. Jacobsen, Schröger, and Sussman (2004) found evidence that it is using ERPs and Monahan and Idsardi (2010) found similar results using MEG. Both concluded that the formant ratio theories are still worth pursuing. 34 Aside from formant ratio theories, there are also theories proposing that just the differences between some combination of the formants and the first formant and the fundamental frequency are the intrinsic variables used by the listener. Traunmüller (1981) has suggested that it the distance between f0 and the first formant plays an important role in vowel perception, especially when judging the “openness” of a vowel. The perceptual vowel model proposed by Syrdal & Gopal (1986) also includes the difference between f0 and F1, as well as between F3 and F2. While the f0-F1 difference is said to distinguish vowel height, the F3-F2 difference distinguishes between the different places of articulation. Intrinsic theories may be able to account for some of listeners’ ability to identify vowel sounds, but they cannot account for all of it. No intrinsic theory has been able to correctly categorize all vowel productions from multiple talkers. While researchers are still pursuing this line of research, there is no question that extrinsic information plays a role in vowel perception, as well. That is, the acoustic information that precedes a vowel can influence the perception of that vowel. For instance, it has been shown that listeners’ perception of the same sound can be affected by a change in talker of a preceding phrase. Extrinsic theories attempt to explain why this occurs. 1.2.2 Extrinsic Theories “…if the formants differ, the sounds are not alike. If, then, listeners hear [three vowels with differing F1 and F2 values] as the same vowel, it is not because of any evidence in the sound. Therefore the identification is based on outside evidence. If this outside evidence were merely the 35 memory of what the same phoneme sounded like a little earlier in the conversation, the task of interpreting rapid speech would presumably be vastly more difficult than it is. What seems to happen, rather, is this. On first meeting a person, the listener hears a few vowel phones, and on the basis of this small but apparently sufficient evidence he swiftly constructs a fairly complete vowel pattern to serve as background (coordinate system) upon which he correctly locates new phones as fast as he hears them.” (Joos, 1948, p. 61) This notion that listeners are able to extract or learn information about a speaker that will help them identify subsequent productions from that speaker defines extrinsic approaches. Joos (1948) continues to expound upon this theory in his monograph on acoustic phonetics. He speculates that listeners must be creating a coordinate space by identifying the extreme or corner vowels of the particular talker and identifying any other vowel productions by placing them in this space. Additionally, he proposes that these acoustic parameters are transmitting articulatory information, rather than purely acoustic information. This idea that the listener is learning about the speaker’s vocal apparatus and articulatory patterns is still debated today. On one side, researchers assume that the acoustic information for a particular speaker is all that is necessary for talker normalization. While on the other side, some theorists suggest that the listener extracts information specific to the articulations that the speakers uses to produce each speech sound. As a result, there are two classes of extrinsic theories – those that claim the listener learns about the acoustics of a speaker and those that claim the listener is learning 36 about the articulations or anatomy of the speaker. In the former case, the listener is extracting an acoustic reference by which other productions will be based. In the latter, the listener is obtaining information about the vocal tract of the specific talker. Ladefoged and Broadbent (1957) set up one of the first studies to test the theory Joos (1948) proposed. They created six different talkers by manipulating the first and second formant of the synthesized phrase: “Please say what this word is.” Each carrier phrase was played before one of four test or target words that had one of four different vowel sounds in a bVt context. Listeners were then asked to identify the target word as “bit”, “bet”, “bat”, or “but.” The hypothesis was that if listeners did map out a vowel space or create a coordinate system based on their knowledge of the speaker, their identification of the target word would shift for different talkers (i.e., different extrapolated vowel spaces). That is exactly what the authors found. For example, there was a significant effect found between the carrier phrase that had both a raised F1 and F2 and the carrier phrase where the F1 and the F2 were both lowered. After hearing the carrier phrase with the raised formants, listeners labeled one of the target words as “bit” 97% of the time. That same target was labeled as “bet” 95% of the time after the carrier phrase with the lowered formants. This provided clear evidence that extrinsic information can affect listeners’ perceptual judgments of vowel sounds. Unlike Joos, the authors did not explore a theory of articulatory gesture but maintained that listeners were creating a vowel space based on the first and second formant like the one shown in Figure 2. Any ambiguous vowel sound’s formants would be plotted in this space and identified as the vowel that sits in that relative space. 37 Many researchers have followed in the path set by Ladefoged and Broadbent (1957) and maintained that listeners are creating some kind of referent based on acoustics, although not always agreeing on what the referent actually is. For instance, Dechovitz (1977a) ran a study that used natural speech productions from an adult male and a nine-year-old male, instead of synthesized speech, and found an effect of talker on the perception of target stimuli. These results were used in support of a theory of vowel space mapping. Van Bergem, Pols, and Beinum (1988) ran a similar study with natural speech (Dutch recordings from a female child and male adult) and also found an effect of talker. Unlike Dechovitz (1977a), though, they concluded that a vowel space theory did not explain their results as well as a theory that relied on templates. Therefore, the acoustic referents that they proposed were average vowel templates that the listener had for men, women, and children. Once the listener categorizes the speaker as one of the three based on pitch and timbre, they use that specific template to identify the speaker’s vowels. Verbrugge, Strange, Shankweiler, and Edman (1976) took a slightly different approach. Based on a hypothesis that Joos (1948) proposed, they tested whether exposure to the point vowels (/i/, /ɑ/, /u/) aided listeners’ perception of target sounds for a specific talker. The logic behind this was that the point vowels have the most extreme formants, and so would be the best candidates for setting up the vowel space. They found that listeners made fewer vowel identification errors when the target was heard within a sentence, rather than after a series of point vowels. This led the authors to believe that listeners were extracting information about the rhythm of a particular talker’s speech and not necessarily creating a vowel space. Ainsworth (1972; 1974) also found evidence that 38 the rhythm of a preceding carrier phrase could affect perception of a subsequent vowel sound. This was suggested as more of a secondary effect after frequency, since the durational effects were the largest for center vowels that bordered two different vowel categories. Assmann, Nearey, and Hogan (1982) found that listeners’ vowel identification improved when they heard stimuli blocked by speaker, rather than in a mixed-speaker condition. This supports the more general notion that the more experience a listener has with a specific speaker, the better the listener is at understanding that speaker’s productions. Not all of these acoustic-referent based studies have been successful, though. Dechovitz (1977b) ran one that used productions from two male adults and this time, did not find an effect of talker. Darwin, McKeown, and Kirby (1989) also ran a series of studies that used the same set-up as Ladefoged and Broadbent (1957) and were not always able to elicit an effect, either. Even when effects are found, researchers propose different cues to account for the normalization (as demonstrated above). And while it cannot be denied that prior exposure to a listener changes the perception of a subsequent target, the explanations behind it is not as clear cut as the original vowel-space proposal from Ladefoged and Broadbent (1957). As mentioned previously, the acoustic-reference theories are one of two types of extrinsic theories. The other type more closely resembles Joos’ original proposal (1948) and assumes that listeners’ reference is not based on the acoustics but on the vocal tract dimensions or the articulation patterns specific to the speaker. Upon exposure to a talker, listeners are able to extract information about that talker’s specific vocal tract anatomy 39 and physiology and articulatory patterns. Halle and Stevens (1962) proposed a two-stage model. In the first stage, the acoustics are converted into vocal tract parameters that would define each phoneme of a particular talker, thus removing any variability due to vocal tract anatomy and physiology. In the second stage, these parameters are transformed into the actual perceived phonemes. This transformation would take into account other potential talker differences, such as dialect, accent, speaking rate, and any other idiosyncratic speech habits. McGowan (1997) and McGowan and Cushing (1999) mathematically tried to implement something similar showing how listeners could take the acoustic signal to derive the vocal tract characteristics of a particular speaker. However, Nordström and Lindblom (1975) suggest that normalization occurs in a simpler fashion. First, a listener uses the third formant of open vowels to estimate the vocal tract length, since the third formant has been shown to have a relation to length (Lindblom, B., & Sundberg, J. E., 1971). Then, the difference between the speaker’s length and the known average for a speaker of the same age and gender is used to extract a scaling factor which will normalize for any differences between the particular talker and the “average” talker. Here, vocal tract length is used to adjust perception, rather than being the way (via articulatory patterns) that perception is actually achieved. A rather different study was that of Remez, Rubin, Nygaard and Howell (1987). They replicated Ladefoged and Broadbent (1957) using sine-wave speech for both the carrier phrases and the targets, finding very similar results to the original and demonstrating that sine-wave speech can be perceived phonetically. They claimed that the changing formant patterns provided substantial information from which the listener could extract vocal tract 40 dimensions. (See Ainsworth (1975) and Nearey (1989) for experiments that pit the intrinsic and extrinsic theories against one another.) 1.2.3 Other Approaches While most talker normalization theories fall into one of the above categories, there are a few that do not, because they do not agree that there is a need for “normalization.” Magnuson and Nusbaum (2007) argue against both intrinsic and extrinsic theories, claiming that both assume that once the differences between talkers have been accounted for, there is a direct correspondence between the source of information and phonetic identification. This suggests that the ability to perceive phonemes relies on a passive system capable of discarding the variance between talkers to get to the information that is necessary for vowel identification. Magnuson and Nusbaum (2007) assert that talker variance is not so easily disposed of and is actually used by listeners to modify their perception. The authors propose that talker normalization is done via an active system that can adjust how the incoming acoustic output is perceived based on knowledge and expectations of the speaker. For example, one of the experiments in Magnuson and Nusbaum (2007) manipulated listeners’ expectations by informing them that the stimuli they were about to hear was either from one speaker or two different speakers. Results showed that expectations did play a difference in the outcome of the experiment. Reaction times were longer when participants believed the stimuli were from two different talkers than when they believed they were from just one talker. Fenn, et al. (2011) found similar results when they tested listeners’ ability to detect a change of voice in a telephone conversation. Unless the voice 41 change was in between genders, most listeners did not notice a change in talker without receiving prior information that there would be more than one talker. Glidden and Assmann (2004) ran a different study, though, on the influences of gender expectations by using visual stimuli matched with acoustic stimuli. They found that the perception of a target sound changes, depending on whether the listener is watching a female or male speaker produce the sound. These results show that listeners’ perception can be influenced by their expectations (or top-down knowledge) and is not only influenced by the acoustic input. Goldinger’s (1998) theory of talker normalization is based on an episodic account of listeners’ ability to understand a wide range of talkers. Similar to Magnuson and Nusbaum (2007), this theory does not actually call for “normalization” but instead claims that listeners store each token of every word that they hear in memory. When a word is heard by a familiar speaker, many of the tokens will be activated for that word and so, that word will be identified. When a word is heard by an unfamiliar speaker, tokens for that word will still be activated but to a lesser degree. Because there will be fewer tokens that are similar to the word for a novel speaker, there will be a longer reaction time for identification. This theory, however, cannot predict, nor explain, that a preceding carrier phrase can shift a listener’s perception of a subsequent target vowel. The units that are stored in memory are at the word level; therefore, the same word should have the same identification, as no prior information is taken into account. If this was the case, listeners’ perception of the same target should not change. For example, in the case of Ladefoged and Broadbent (1957), the ambiguous “bit/bet” target should have always 42 been perceived as either “bit” or “bet.” Further, upon hearing the same ambiguous target over and over, listeners’ memory of that word should have become stronger and stronger, thus solidifying its identification as the same percept. The exemplar model would have to change the size of the unit to be able to explain extrinsic effects. This could very quickly run into problems, though. If the size of the unit was extended to include several words, then would the listener need to have heard those words in that specific order to understand them? Due to the infinite number of ways words could be grouped together, this idea does not seem like a plausible explanation. Thus, it appears that the exemplar theory cannot fully explain how listeners understand many different talkers. As mentioned previously, though, even the predominant theories on extrinsic normalization fall short. The next section discusses a relatively novel approach based on general auditory interactions in perception. 1.3 Long-term Average Spectrum (LTAS) Model As mentioned previously, a substantial change in talker has not always led to a change in perception of a target phoneme. Dechovitz (1977b) was not able to find an effect of talker when he used stimuli from two male adults, even though the speakers’ vocal tracts differed greatly in size and shape. Darwin, McKeown, and Kirby (1989) also failed to demonstrate talker-based shifts when they presented listeners with filtered carrier phrases and asked them to identify a target at the end. Why is it, then, that listeners’ judgment of vowels does not always change as a function of speaker? A possibility extends from the work of Holt (2005; 2006). These experiments investigated the effects of non-speech precursors on the identification of a target speech 43 sound. Specifically, listeners were presented with a sequence of tones followed by one target speech sound from a series that changed perceptually from /dɑ/ to /gɑ/. Listeners were asked to identify whether the target sound was a /dɑ/ or a /gɑ/. One of the main cues that listeners use to distinguish between these two sounds is the transition of the third formant. If the transition starts high, the speech sound is typically identified as /dɑ/. If the transition starts low, the perceived sound is /gɑ/. Holt (2005; 2006) found that listeners’ boundary along the series between /dɑ/ and /gɑ/ shifted as a function of the mean frequency average of the preceding tone sequence. If the tone sequence had a high average above the F3 of the target, listeners heard /gɑ/ more often. If the tone sequence had a low average below the F3, listeners heard /dɑ/ more often. In fact, this result was predicted from previous research on phonetic context effects (Lotto & Kluender, 1998; Holt, Lotto, & Kluender, 2000; Lotto & Holt, 2006), which showed that a preceding high-frequency sound (whether a formant, tone or noise band) can cause the perceived formants of a following speech sound to be effectively lowered. In this case, a /dɑ/ would become a /gɑ/, since a /gɑ/ has a lower frequency F3 – just as Holt (2005; 2006) found. Even further, these effects can be predicted by calculating the spectral average across the tone sequence precursor and comparing it to the spectral average of the target (particularly in the region of an identifying cue like the F3 for /dɑ/ to /gɑ/). This suggests that listeners may be calculating some kind of average during the precursor and using this average when making perceptual judgments. This type of effect is known as a spectral contrast effect – the perception of a previous stimulus affects the perception of a current stimulus in a contrastive manner – 44 and different types of contrast effects have been found in other perceptual domains. For example, when participants are asked to pick up a weight and judge how heavy it is, their perception changes, depending on whether they picked up a heavier or lighter weight beforehand. If they picked up a heavier weight, the target weight seems lighter and if they picked up a lighter weight, the target weight seems heavier (Di Lollo, 1964). Similar types of effects on color perception have also been found in the visual domain (Chevreul, 1839). It seems that contrast effects are a general property of our perceptual systems. In the auditory domain, this spectral contrast effect appears to cause changes in speech perception. Holt’s (2005; 2006) demonstration of this with a preceding tone sequence followed by a target is provocatively similar to the paradigm of Ladefoged and Broadbent (1957). In both, listeners are presented with a preceding stimulus (tones or speech) followed by a speech sound that they are asked to identify. And, in both, listeners’ perception of the target changed as a function of the content of the preceding stimulus. Since these tasks are so similar, it is tempting to suggest that the same mechanism can account for both of them. That is, listeners could be calculating the average energy at each frequency during the precursor (known as the long-term average spectrum or LTAS) and using this as a referent for target perception. If the LTAS of a particular carrier phrase has a peak in spectral energy close to the frequency of a target speech sound’s formant (and this formant is important for the identification of that particular target), then the carrier would be predicted to have an effect on the target’s perception. Alternatively, if the LTAS does not show a peak close to a formant, then it could be predicted that the 45 carrier phrase would not have an effect on the target’s perception. Additionally, the LTAS comparisons would not only show when a target’s perception would be affected, but also how the target’s perception would be affected. Like in the studies of Holt (2005; 2006), the tone sequence with a high average would be expected to lower the perception of the third formant in the “da” to “ga” series and so was predicted to result in more “ga” responses. By comparing the LTAS of a precursor and a target, one is capable of making accurate predictions of how a particular target sound will be perceived. This novel approach has the potential to explain the results of past extrinsic normalization studies that use this paradigm where a listener hears preceding context (whether they are tones or speech) and asked to identify a target speech sound at the end. LTAS comparisons between the carrier phrases and the target should reveal that there was a difference in the frequency of the peaks in spectral energy between the carrier phrases in a frequency region close to one of the formants of the target. For example, the main manipulation used by Ladefoged and Broadbent (1957) was to lower or raise the first and second formant across the phrase “Please say what this word is.” For one of the phrases, they raised both the first and second formant. For another of the carrier phrases, they lowered both of the formants. The target sound they used was an ambiguous speech sound identified as “bit” half of the time and as “bet” half of the time. When listeners heard this target after the carrier phrases, they heard “bit” more often following the carrier phrase that had raised formants. These results could be predicted from an LTAS account. An important cue for listeners when making a distinction between a “bit” and a “bet” is the first formant. If the first formant is low, listeners will identify the sound as 46 “bit.” If the first formant is high, listeners will identify the sound as “bet.” The carrier phrase with the raised formants would have had more spectral energy in the higher frequencies. This could have effectively lowered the perception of the first formant in the target and so would explain why the carrier phrase with the higher formants resulted in more “bit” responses. This model is, therefore, capable of predicting and explaining the effects that a carrier phrase can have on target perception. These comparisons may also be able to explain when shifts in target perception did not occur. It is possible that a preceding carrier phrase would not have energy close to a target’s formants and so would not be predicted to shift listeners’ perception. This suggests that the spectral content of the carrier phrase is determinant of whether an effect is found, not the speaker of that phrase. This means that a change in talker will not necessarily result in a change in perception. Only those changes in talker that result in changes of the spectral average (around a formant) will cause a shift in target perception. This provides an alternative explanation for experiments that found no difference on target perception when the carrier phrase was produced by different talkers (Dechovitz 1977b; Darwin, McKeown, & Kirby, 1989). For example, the carrier phrases used in Dechovitz (1977b) were produced by two male talkers who had “substantially different vocal tract dimensions” (pp. S39). While the differences between the speakers are not explained further, it could have been that they were still very similar (both being male speakers) that any theory would predict that the target would not have been perceived differently. For instance, a theory based on vowel-space mapping would claim that their vowel spaces were very similar and so, would predict similar perception of the target. 47 From the perspective of the LTAS account, though, the lack of an effect would have been because there was not a substantial difference in the LTAS of the two carrier phrases that would cause a change in perception. Without the original stimuli, the ability of the LTAS model to explain past results cannot be tested. However, it does present an appealing alternative account for why talker normalization effects have not always been found and can be used to make testable predictions in future studies (as this research aims to demonstrate). These predictions can, in fact, tease apart the differences between the predominant theories and this new LTAS theory. Broadbent and Ladefoged (1960) actually proposed something similar to the LTAS account when they revisited their theory of talker normalization. They changed their explanation to one that was based on a theory of adaptation level (Helson, 1964). Essentially, if listeners hear a series of sounds or a phrase with a high first formant, then the “adaptation level” (or average F1 value for that speaker) would be set high. When the listener then hears a target word, its perceived F1 value would be judged based on this immediate exposure that the listener had with that talker and the word would be identified as the word with the lower F1 (so, in the case of “bit” or “bet” it would be perceived as “bit”). This theory is closer to an LTAS account than their previous theory, which proposed the listener was making an acoustic-phonemic map for each talker. More recent findings of Watkins (1991) and Watkins and Makin (1994; 1996) have also leaned towards the need for a new model that could account for effects of extrinsic talker normalization. They have used spectrally distorted and filtered carrier phrases and shown that these affect vowel perception differently, depending on their spectral content. 48 However, neither of these groups has developed a full theoretical model that can provide an explanation for effects of changes in talker. The purpose of the research discussed in this dissertation proposes and hopes to develop a novel, quantitative model that is capable of predicting effects of talker normalization. 49 CHAPTER 2 GENERAL AUDITORY MODEL 2.1 Limitations of Previous Stimuli While previous research on talker normalization has unquestionably helped further the development and knowledge of the field, the stimuli used in many of the experiments limits the conclusions that can be drawn. One type of stimulus set used follows the work of L&B (Ladefoged and Broadbent (1957) will be referred to as L&B from this point forward): A stimulus or stimulus set is recorded from a speaker and then, synthesized and manipulated, potentially in ways that are not possible with a human vocal tract. For example, six versions of the carrier phrase “Please say what this word is” from L&B’s study were created to sound like each one was produced by six different talkers. This was done by manipulating one or both of the formants up or down in frequency from their original frequency value. The problem with stimuli of this type is that the manipulations imposed may not be realistic for what a speaker is physically capable of with his/her vocal tract, and so are unconstrained by what is anatomically and physiologically possible. To investigate talker normalization, it would be more beneficial to have stimuli that are accurate representations of different talkers’ productions. This would help substantiate the interpretation of results from perceptual experiments, since the stimuli would more closely resemble the real-world input that a listener receives. This leads to the other stimulus type – actual recordings from multiple talkers that are not synthesized or manipulated but left in their raw form. While these are obviously 50 realistic and “real-world,” it is hard to control for the types of differences between talkers and know what the differences actually are and how they affect the acoustic output. Even if dialect/accent and gender are controlled, there are still variations in the anatomical structure of the vocal tract and idiosyncratic production styles that remain unknown. What is gained in terms of ecological validity is at the expense of control and knowledge of the important variables. It would require a great deal of imaging and articulatory measures to know the true anatomical differences that exist between a set of talkers. Further, even with this knowledge, it would be difficult to match other aspects of speech production like the speaking rate or timing of the articulations and the source or pitch variation over time. What is needed is a stimulus set that is realistically related to talker differences but, at the same time, provides controllable and knowable talker differences. 2.2 TubeTalker One way to create stimuli that are both controllable and realistic, and thus avoid the problems of the previously mentioned stimulus sets, would be through the use of a computational model that could control both the source and filter of a modeled vocal tract. Such a model would need to have two specific characteristics. The first would be that the model be based on analyses of real vocal tracts. That is, the model would have to function and be constrained in the same manner as real vocal tracts are during production. The second characteristic would be that the model allows the user to have complete control over all aspects of the vocal tract (source, size, shape, length, articulatory pattern, etc.). With such a model, different talkers could be created in a realistic and controllable way. Any changes in the formants or acoustic output would be based on actual 51 modifications that vocal tracts can undergo and not post-hoc adjustments to an already recorded stimulus. Further, the cause for any acoustic variation between talkers would be known, since the user would know which parameters were set differently. For example, if two talkers were created that had all parameters set equal except for their length, then length would be the reason for any differences in the acoustic output. Such a computational model exists and is known as the TubeTalker (Story, 2005a; Story & Bunton, 2010). 2.2.1 TubeTalker Description In its simplest form, speech production can be thought of as two separate components: the source and the filter (Fant, 1960). The voicing source provides the energy or activation of the system via an interaction of the respiratory system and vocal folds. The filter is the shape of the vocal tract, or, rather, the airspace created by the articulators above the glottis, essentially a shaped “tube” through which the source passes. Each possible shape of the filter or tube results in a set of resonances or formant peaks. These different sets of resonant peaks help define different speech sounds. During speech, the source and filter work together in a continuously changing manner. The source is either creating voiced or voiceless sounds (the vocal folds are either in vibration or not). The vocal tract is moving fluidly through various airspace shapes (the articulators are in constant motion). The coordinated control of both produce a string of speech sounds that a listener is able to decode as a meaningful message. The source-filter theory is one of the underlying principles behind the TubeTalker synthesizer. The source of the TubeTalker is based on a kinematic model of the vocal 52 folds, which allows manipulation of a number of parameters: amount of adduction at the arytenoid processes, shape of the glottal opening at rest, phase and amplitude of the movement of the upper and lower portions of the vocal folds, thickness of the vocal folds, length of the glottal opening, and fundamental frequency (Titze, 1984). The effects of the filter are computed based on a wave-reflection model that takes into account how the acoustic waves interact with physiological aspect of the vocal tracts, such as the skin and side cavities (e.g., the piriform sinuses) (Story, 1995). The filter can be thought of as a tube system made up of 44 mini-tubes or “tubelets.” Each tubelet’s length and cross-sectional area can be specified individually so that the model can take into account articulatory movements like lip protrusion or a lowering of the larynx. The length of each tubelet can also be set to create more general differences. For example, keeping the cross-sectional area constant, while changing the length of each tubelet to be longer or shorter, can create different “vocal tracts.” This essentially changes the length of the tube and would cause a change in perceived talker, even if it is constricted and expanded in the exact same way as another tube. On the other hand, the cross-sectional area could be manipulated while the length is maintained to create vocal tract shapes that differ in cavity size of the oral or pharyngeal regions. This vocal tract synthesizer gets around the problems that were mentioned before by allowing for the creation of multiple talkers with known vocal tract features and articulatory patterns. An underlying tube shape can be made by setting the starting crosssectional area and length of the tubelets and then can be constricted and expanded to 53 create different speech sounds. All of these parameters can be controlled and the effect that they have on perception can be examined. 2.2.2 Development of the TubeTalker As its name implies and explained above, the TubeTalker can be thought of as a tube system. The tube can be constricted or expanded to match the cross-sectional area function of a vocal tract (or, rather, the airspace within the vocal tract) that produces that same sound. The cross-sectional area functions that the model uses were based on MRI images of sustained phoneme productions from a male vocal tract (Story, Titze, & Hoffman, 1996). Analyses from these images allowed for the development of a hierarchical system that incorporates three tiers to produce continuous, time-varying speech (Story, 2005a). Basically, the model starts with a “neutral” vocal tract shape (Tier 1). This shape is then modified to create particular vowel sounds (Tier 2) and then, consonant constrictions are added to the underlying vowel-to-vowel transitions at specific points in time (Tier 3) to create synthesized speech. The vowel shaping in the model is controlled by two parameters that are the first two principal components derived from the area functions for ten of the vowels from Story, Titze, & Hoffman (1996). These analyses showed that the mean shape of the area functions for the vocal tract approximated a uniform tube (the neutral vocal tract shape). This neutral shape or starting point is Tier 1. Story and Titze (1998) suggest that the acoustic output of this shape is the neutral vowel sound and that this neutral shape is “perturbed” during speech production to create other speech sounds. To create vowels, this shape is perturbed by two “modes.” The modes can be thought of as two different 54 tube shapes whose values can be set independently of one another to create a combined shape with unique resonant properties. The first mode accounted for 75% of the variance in area functions and when its value is increased, it raises the frequency of F1 but lowers the frequency of F2. The second mode accounted for 18% of the variance and when its value is increased, it raises the frequency of both F1 and F2. From an articulatory standpoint, the first mode could be thought of as controlling the upward/downward and backward/forward movement of the tongue and some of the upward/downward movement of the jaw. The second mode has the most influence on the middle of the vocal tract (tongue arching and upward/downward motion), as well as controlling lip rounding. Using the modes to change the shape of the neutral vocal tract and create vowels is Tier 2. The third tier adds the consonant constrictions to the underlying vowel component of tier two (also based on analyses from Story, Titze, & Hoffman (1996)). These can be defined in a number of ways (e.g., location and area) that can be constrained by how and where the articulators constrict the air space of the vocal tract for real speech productions. Once these are configured, the voice source described above is added to the system to create synthesized speech. 2.2.3 Creating Different Talkers The parameters for the neutral vocal tract shape can be modified by adjusting the length and cross-sectional area of the tubelets mentioned previously. The standard vocal tract created by default settings of the TubeTalker model is that of a male with a vocal tract length of 17.5cm. To create different speakers one can adjust the length to be longer 55 Figure 3. A pseudo-midsagittal view of the neutral vocal tract shapes for the five “talkers” created from the TubeTalker. (Note that the center VT in both the bottom and top row is the StandardVT. This was done to make it easier to compare the StandardVT to each variation.) For each vocal tract, the cross-sectional area of each tubule is plotted so that it is perpendicular to and equidistant above and below the centerline of the vocal tract. The x-axis is the anterior-posterior (A-P) dimension and the y-axis is the inferior-superior (I-S) dimension. (Image created by Dr. Brad Story.) or shorter, while keeping the cross-sectional area constant. On the other hand, the length can remain constant and the cross-sectional area can be changed. The oral cavity could be expanded while the pharyngeal cavity is constricted or vice versa. Figure 3 shows the neutral vocal tract shape for TubeTalkers with these modifications. Once these shapes are made, parameters can be set for the vowel modes and for consonant constrictions that will perturb the neutral vocal tract shape. In this way, the model allows for complete control over synthesized talkers. The TubeTalkers differ in neutral vocal tract shape but 56 all other parameters can be constant – rate of production, vibratory source, pitch contour, vowel modes, and consonant constrictions. The exact same modifications are happening, just to different underlying shapes. Thus, the synthesizer solved the issues of the previous stimuli, because (1) The acoustic output will be changed in realistic ways, since TubeTalker is based on real vocal tracts and constrained by their physiological limits; and (2) The articulatory patterns and anatomical structure of the different “talkers” is controlled, instead of containing unknown variability. The majority of the stimuli used in this dissertation were created using the TubeTalker model. 2.3 General Auditory Model A commonality across all of the theories of extrinsic talker normalization is that the listener is assumed to compensate for differences between talkers by extracting information specific to the speaker. Whether this was an acoustic-based reference, such as a speaker’s vowel space, or the actual articulatory patterns for a speaker, the theories assumed that the listener was using some type of speaker-specific knowledge. If this were true, a substantial change in talker should have led to a change in vowel perception; however, this was not always the case. Dechovitz (1977b) and Darwin, McKeown, and Kirby (1989) were not always able to find an effect as a function of talker. This is a challenge for these predominant theories of extrinsic talker normalization. Chapter 1 proposed an alternative explanation that stemmed from the work of Holt (2005; 2006). This explanation is central to the model proposed and explored in this dissertation. Holt (2005; 2005) found evidence of a general auditory mechanism that calculates the average of the spectral content of a non-speech stimulus and uses this average when 57 perceiving a following speech sound. It is being proposed here and has been proposed previously (Lotto & Sullivan, 2007) that listeners calculate the average of the spectral energy of a preceding speech phrase, as well. This average (referred to as the long-term average spectrum or LTAS) is then used as a referent when identifying speech sounds. By comparing the LTAS of preceding stimuli (speech or non-speech) with the target speech sound, one is able to make predictions about which phrases will shift the perception of the target sound. If a carrier phrase’s LTAS has a peak in frequency that is close to the cue important for the target’s identification (such as a formant), then the LTAS could affect the perception of that phoneme. For instance, if the LTAS peak is lower in frequency than a subsequent target vowel’s F1, it could effectively raise the perception of that target’s F1 and change the listener’s perception of it – a spectral contrast effect. This means that LTAS comparisons do not only predict when there will be an effect on perception, but also, how perception is affected (i.e., in what direction the target’s perception will be shifted). This was the first idea from which the LTAS model was based. The second idea that formed the basis of this model was a finding of Story (2005b; 2007b). He expanded his original research on vocal tract area functions to six other speakers, showing that the modes generalize across them. However, while the modes are very similar, the neutral vocal tract shapes are different – “the modes appeared to provide roughly a common system for perturbing a unique underlying neutral vocal tract shape” (Story, 2005b, p. 3855). Thus, the average vocal tract shape appears to be what separates or distinguishes between different speakers. Lotto and Sullivan (2007) 58 suggested that if a listener can gain access to the acoustic output of the neutral vocal tract shape (the neutral vowel), they may be able to use this to normalize for differences across talkers. The model that I have developed and test in this document assumes that listeners can, in fact, use the neutral vowel to normalize for talkers. Specifically, it suggests that listeners estimate the neutral vowel via the general auditory mechanism that calculates the LTAS. If the average of the different vocal tract shapes is the neutral vocal tract shape, it may be that the average of the acoustic output of different vocal tract shapes (speech) resembles the acoustic output of the neutral shape (the neutral vowel). Thus, the spectral average that listeners calculate over the duration of the carrier phrase could be similar to the spectral properties of the neutral vowel. Listeners could then be using this average as the referent by which they base their phonemic judgments. Figure 4 shows the vowel space of Figure 2, but this time the neutral vowel for each vocal tract is set as the origin. The variability between talkers is reduced substantially when the neutral vowel is set as the referent. If listeners do calculate an LTAS of the carrier phrase and this average resembles the neutral vowel of that talker, then listeners’ perception of a target speech sound would only change when the LTAS of the talker’s neutral vowel showed a contrastive effect with the formant of the target. It is necessary to point out, however, that these two ideas are mutually exclusive. It may be the case that listeners calculate the spectral average of a preceding carrier phrase and use it as a referent, but it does not reflect the acoustic output of the neutral vowel for a specific talker. The opposite could also be true. 59 Figure 4. Plot of the same vowels from 2 with the addition of the neutral vowel. The graph has been adjusted so that the neutral vowel is the referent vowel or origin for each speaker. Listeners could be extracting the acoustics of the neutral vowel and using this to normalize for different talkers, but they may not be using the LTAS as defined here (see Section 3.2.4.1 in Ch. 3) to do so. The general auditory model proposed here is not strictly tied to either of these ideas but rather to the development of a model that can explain the adaptive perception of speech. If part of the model is found unsupported, then it will evolve and incorporate these findings into a newer version of itself. This being said, the model at this point does include both of these ideas (the LTAS mechanism and the neutral vowel) and based on these, an alternative model of talker normalization is proposed that has four main hypotheses: 1) during carrier phrases, listeners compute an LTAS for the talker; 2) this LTAS resembles the spectrum of the 60 neutral vowel; 3) listeners represent subsequent targets relative to this LTAS referent; 4) such a representation reduces talker-specific acoustic variability. 61 CHAPTER 3 GENERAL METHODOLOGY AND PRELIMINARY STUDIES 3.1 Predictions Specific predictions can be made based on the four hypotheses stated at the end of Chapter 2. These predictions are as follows: Prediction #1: Shifts in perception of a target phoneme can be elicited when the LTAS of preceding contexts predicts opposite contrast effects on the target. Similar to the effects found by L&B, a change in talker can lead to a difference in target perception. According to the LTAS model and its hypotheses, these effects are general in nature and so are not limited to target vowel identification but should extend to consonant identification, as well. Changes in “talkers” created from the TubeTalker should also be capable of producing these shifts if the change in LTAS for the talker results in differential contrast with the spectrum of the target. Prediction #2: ONLY those talker differences resulting in a change in LTAS that corresponds to a differential contrast of an important phonetic cue will result in a shift in perception. If these effects are not due to talker-specific normalization processes but strictly based on contrastive effects between the LTAS of the carrier phrase and the target, then it is possible that a substantial change in talker (whether the talker is real or created with the TubeTalker synthesizer) may not affect subsequent target 62 perception. The LTAS for two (or more) talkers could fail to lead to differential contrast with the target spectrum, resulting in no perceptual effects. Prediction #3: Backward stimuli should cause the same effect as the stimuli played forward. If the calculation of the LTAS does serve as the referent and targets are identified relative to it, then stimuli that have the same exact LTAS should cause the same exact effect on target perception. Therefore, playing stimuli in reverse should have the same effect as when the stimuli are played forward, since the LTAS of both is the same. Prediction #4: The LTAS that is calculated is an extraction of the neutral vowel; therefore, a talker’s neutral vowel should have the same effect on perception as a phrase produced by that talker. The neutral vowel vocal tract shape is what varies between talkers, while the articulatory patterns are quite similar (Story, 2005b; 2007b). The LTAS model proposes that “normalization” occurs because the LTAS of a speaker will come to resemble the output of the neutral vocal tract. Thus, if listeners use the LTAS as a referent, individual differences in talker vowel spaces will have a reduced effect on listener identification. This implies that the LTAS of carrier phrases should have an effect similar to that of the neutral vowel. Each of these predictions was specifically tested in a preliminary study to further develop the model. 63 3.2 General Methodology 3.2.1 Participants All participants were recruited from the University of Arizona and received either course credit or monetary compensation for their time. All reported no history of hearing loss and English as their native language. 3.2.2 Stimuli Carrier Phrases. TubeTalker was used to create five different “talkers” by manipulating the underlying or neutral vocal tract (Figure 3). The first vocal tract was the standard vocal tract (StandardVT). Its length was 17.5 cm (tubelet = .39682 cm). The other four vocal tracts were all modifications of the StandardVT. The short vocal tract (ShortVT) was 20% shorter than the StandardVT (14 cm; tubelet = .317456) and the long vocal tract (LongVT) was 20% longer than the StandardVT (20 cm; tubelet = .476184). The synthesized vowels from Figure 1 were created with the StandardVT, ShortVT, and LongVT. The next two “talkers” differed in the neutral shape but were constant in length. These were created by manipulating the cross-sectional area of the oral and pharyngeal cavities relative to the Standard VT. One had a constricted pharyngeal cavity and an expanded oral cavity (O+P- VT) and the other had an expanded oral cavity and a constricted pharyngeal cavity (O-P+ VT). Some combination of these talkers was used in the experiments that follow. It is important to note that creating carrier phrases with the TubeTalker is a complex process. Because of this, phrase choices were limited to what the author (Story, 2005a) had created at the time of the development of these experiments. 64 Figure 5. Spectrograms of the endpoints of the the /dɑ/-/gɑ/ target series. The onset frequency of the third formant is labeled on the y-axis. Targets. The targets were always a series in which one of the segments gradually changed phonetic identity. Three different series were used in the experiments. The first was the 10-step /dɑ/-/gɑ/ series used by Holt (2005; 2006). Each target stimulus was 330 msec long. They were based on natural tokens of /dɑ/ and /gɑ/ produced by a male, native-English speaker and were created using LPC analysis/resynthesis. The onset of the second and third formant was manipulated in approximately equal steps from an unambiguous /dɑ/ to an unambiguous /gɑ/. Spectrograms of the endpoints can be seen in Figure 5. The second and third target series were created from one continuous series synthesized from the StandardVT. This target series consisted of 20 steps that changed from “bit” to “bet” to “bat.” These targets did not change in equal formant intervals (as is 65 typical when creating a target series) but changed in equal area or articulation steps (using the two modes described previously). Each member of the series was 250 msec in length and had a pitch contour that varied from 93 Hz at the /b/ to 100 Hz during the vowel and then dropped to 87 Hz before the /t/. The first twelve steps were used as a “bit” to “bet” series (bVt1-bVt12). The first two formants for the vowels in this series are plotted in a vowel space in Figure 6. This series was used in the majority of the main experiments; however, in Chapter 5 (Experiment 6) an eight step “bet” to “bat” series was used. These consisted of the 10th-17th step of the series (bVt10-bVt17). Figure 6. F1xF2 vowel space of the vowel portion (at .082 sec) of the 12-step “bit” to “bet” target series. 3.2.3 Procedure All of the preliminary studies (and the perceptual experiments described in Chapters 5, 6, and 7) follow a similar paradigm to that of Ladefoged and Broadbent (1957). The stimuli consist of a carrier phrase followed by a target syllable or word after a 50-msec inter-stimulus interval. Listeners are asked to identify the target as either one 66 of two options. For a summary of the carrier phrases and target combinations used in each of the preliminary experiments, see Table 1. Participants were run in groups of one to four on separate computers in a large sound proof booth. On each trial, they were presented with a single randomly-selected Carrier Phrase +Target stimulus over circumaural headphones (Sennheiser HD 280) at approximately 75 dB SPL. They categorized the target by using a mouse to click one of two labeled boxes on a monitor (Figure 7 shows an example of the screen that participants saw during experimentation). The next trial did not begin until the participant responded. The stimuli were split into blocks with a certain number of stimulus repetitions in each block. This gave participants the opportunity to take short breaks between blocks if needed. The entire session, including consent and debriefing, took no more than one hour. Stimulus presentation and data collection were controlled by the ALVIN software program (Hillenbrand & Gayvert, 2005). 3.2.4 Predictions Predictions for each experiment were derived from comparisons of the average spectral energy for each carrier phrase with the spectrum of the ambiguous member of the target series (bit/bet = bVt5, bet/bat = bVt13, da/ga = dg5). Appendix A explains how the ambiguous targets were determined. Each of the predictions was based on a specific frequency region, depending on the target series being used. For the “bit” to “bet” and “bet” to “bat” series, the LTAS comparisons show the frequency region around the first formant (0-1200 Hz). This is because the first formant is the important cue for distinguishing between these three vowels (/ɪ, ɛ, æ/). An /ɪ/ has a lower first formant than 67 Pred. 1 Exp. 1 2 2a 2b 3 3a 3b 4 4a 4b Carrier Phrase (1) ShortVT – “Abra…” (2) LongVT – “Abra…” (1) ShortVT – “He had a…” (2) LongVT – “He had a…” (1) O+P-VT – “He had a…” (2) O-P+VT – “He had a…” (1) ShortVT – “He had a…” (2) LongVT – “He had a…” (1) ShortVT – “He had a…” (2) LongVT – “He had a…” (1) ShortVT – Neutral Vowel (2) ShortVT – “Please say…” (1) ShortVT – Neutral Vowel (2) ShortVT – “Please say…” Duration (1) 1.4 sec (2) 1.4 sec (1) 1.3 sec (2) 1.3 sec (1) 1.3 sec (2) 1.3 sec (1) 1.3 sec (2) 1.3 sec (1) 1.3 sec (2) 1.3 sec (1) 400 msec (2) 1.7 sec (1) 400 msec (2) 1.7 sec FW/BW Targ. Series FW /dɑ/ - /gɑ/ FW /bɪt/ - /bɛt/ FW /bɪt/ - /bɛt/ BW BW /dɑ/ - /gɑ/ FW /dɑ/ - /gɑ/ Table 1. A summary of the stimuli used in each of the preliminary studies (Pred. = Prediction; Exp. = Experiment). Figure 7. An example of the computer screen participants see during experimentation. The software used to run the experiments is ALVIN (Hillenbrand & Gayvert, 2003). 68 an /ɛ/, which has a lower first formant than an /æ/. The frequency region used to make predictions concerning “da/ga” perception is from 1800 to 3000 Hz. This region contains the third formant, the important cue for /dɑ/-/gɑ/ distinction. A good “da” has a higher F3 starting frequency than a good “ga,” having a low F3 starting frequency. (See Appendix B for a continued discussion of why and how these frequency regions were focused on in making LTAS predictions). Generally, if the LTAS of the carrier phrase had a trough (low amplitude) in the region of the ambiguous target’s formant and a frequency peak lower or higher than that formant, that carrier phrase was expected to have an effect on target perception. Figure 8 shows an example of an LTAS comparison with both of these characteristics. The carriers are the phrase “He had a rabbit and a” synthesized from both the Long and Short VT. The ambiguous target is the fifth step in the “bit” to “bet” series produced by the StandardVT. Note that the ShortVT phrase has a higher frequency average peak in the F1 region than the target and the LongVT phrase has a peak that largely overlaps with the target. Based on this figure, the LTAS model would predict that the ShortVT phrase would cause an effective lowering of the F1 in the target, making it more “bit” like. Thus, a test of the prediction would be whether there are more “bit” responses for the ShortVT versus the LongVT contexts. It is important to note that while the spectral comparisons are referred to as “LTAS” comparisons, the images actually use a combination of LTAS, linear-predictive coding (LPC) and discrete fourier transform (dFT) spectral representations. The choices are based on the length and spectral content of the carrier phrase. As a general rule, LPC is used for stimuli less than one second in 69 Figure 8. LTAS comparison between the carrier phrases “He had a rabbit and a” of the LongVT and ShortVT and the ambiguous target from the “bit” to “bet” series (bVt5) in the region surrounding F1. (LTAS = LongVT, ShortVT; LPC = bVt5). length. The figure caption for each LTAS comparison will specify which type of representation is being used for each stimulus. 3.2.4.1 Deriving the LTAS Signal Analysis. The LTAS provides an analysis of the spectral properties of an acoustic signal over a longer temporal duration than is typical with a spectrum. The LTAS separates the signal into a certain number of bins or windows of time. For each time frame, a spectrum is calculated using a dFT. The average amplitude for each frequency is then calculated across these time frames. The LTAS can then be plotted in terms of frequency by amplitude. 70 In the past, the LTAS has been used to determine the distribution of frequency energy present for hearing or other acoustic detection systems. In many of these cases, it has been used to describe running speech of male, females, and children in an attempt to advance hearing aid development (i.e., to determine which frequency regions contain more energy and may be more important for speech perception) (Dunn & White, 1940; Byrne, 1977; Cornelisse, Gagnńe, & Seewald, 1991; Byrne, et al., 1994; Pittman, Stelmachowicz, Lewis, & Hoover, 2003). More recently, it has been used to describe voice quality differences: male vs. female (Mendoza, Valencia, Muñoz, & Trujillo, 1996), young vs. old (Linville, & Rens, 2001; Linville, 2002), and singing vs. speaking (Cleveland, Sundberg, & Stone, 2001). These studies use the LTAS in a descriptive fashion to provide a measure of the frequency composition of input. The theoretical account proposed in this dissertation suggests that listeners actually calculate something akin to the LTAS during speech comprehension. This average is then used to make subsequent perceptual judgments. This proposal is novel and utilizes the acoustic measure of LTAS as a perceptual representation in much the same way that the FFT spectrum went from being an acoustic analysis to the presumed representation of speech perception. Within this proposal, the LTAS comparisons were computed and graphed in the MATLAB programming environment (The Math Works, Natick, MA, USA) with inhouse scripts written by Dr. Brad Story. As mentioned, a combination of analyses was used, depending on which best represented each specific stimulus. The LTAS was computed by taking the average of non-overlapping spectra extending over the entire 71 duration of the signal. Each individual spectrum in the LTAS corresponded to a 100 msec Hanning-windowed section of the waveform, and zero-padded to 8192 points prior to computing an FFT. The LPC spectra were calculated for a 25 msec Hamming windowed section of the waveform using an LPC analysis with 46 coefficients. The sampling frequency of all analyzed waveforms was 44100 Hz. Neural Representation of LTAS. While the LTAS has been used to analyze acoustic signals and listeners’ ability to calculate it is an assumption of the model proposed, there are no findings demonstrating a neural representation of the LTAS as of yet. Sjerps, Mitterer, and McQueen (2011a) ran an EEG study, investigating neural responses to spectral contrast effects with speech stimuli. They found a response in the N1 time window, leading them to suggest that these effects occur at a very early stage of processing (within the first 100 – 200 msec) that is not influenced by conscious processing of the listener. However, the authors do not give any suggestion of where or how this is actually being represented and processed in the brain. 3.2.5 Analysis Methods Depending on the experiment, separate paired-sample t-tests or ANOVAs were conducted to show whether shifts in target perception were significant. The dependent variable for the perceptual studies was the percentage of “da,” “bit,” or “bet” responses collapsed across all series members (for the preliminary studies) and across the five middle targets (for the main experiments). As the middle of the series contains the more ambiguous targets, it was determined that collapsing across these is more sensitive to shifts in target perception than collapsing across the entire series. The five middle targets 72 for the bit/bet series were bVt4-bVt8 and for the bet/bat series were bVt11-bVt15. The carrier phrase served as the independent variable. 3.3. Prediction #1: Shifts in perception of a target phoneme can be elicited when the LTAS of preceding contexts predict opposite contrast effects on the target. 3.3.1 Preliminary Experiment 1 The earlier work on spectral contrast by Lotto and Kluender (1998) and Holt (2005; 2006) used target series that varied from /dɑ/ to /gɑ/ (as opposed to the vowel targets in L&B). In order to link these two empirical paradigms within one model, one of the target series used was the /dɑ/-/gɑ/ targets (of the work of Lotto & Holt), following carrier phrases produced by different talkers (as in L&B) created with TubeTalker. Thus, the first aim of this experiment was to test whether the effects L&B found with vowel normalization could be extended to consonant identification, as would be predicted by the LTAS model. The second aim of the experiment was to validate that the different “talkers” created with TubeTalker could indeed provide the same types of shifts in target categorization as L&B demonstrated. 3.3.2 Stimuli and Predictions LTAS comparisons showed that the phrase “Abracadabra” synthesized from the LongVT and the ShortVT would have a contrastive effect on the /dɑ/-/gɑ/ target series. Examining the region around the third formant showed that the LongVT had a frequency peak lower in frequency than the ambiguous target’s F3 and a trough in the region of the ambiguous target’s F3. This could push the effective F3 of the target higher in frequency causing more “da” responses. The ShortVT does not have a peak, though and so would 73 not be expected to have an effect. This comparison is shown in Figure 9. The figure only shows the F3 regions as the third formant is the important cue for /dɑ/-/gɑ/ distinction. 3.3.3 Participants and Procedure Seventeen subjects participated in this study. There were 200 trials total (1 carrier phrase x 2 talkers x 10 targets x 10 repetitions). 3.3.4 Results and Discussion A significant shift in “da” responses in the direction predicted by the LTAS comparison (t(16) = 5.07, p < .005) was found between the ShortVT and LongVT (See Figure 10). The results provide a demonstration of a shift in consonant target perception as a function of talker in the carrier phrase. This links the paradigms used by Ladefoged & Broadbent on talker normalization with the work by Lotto and Holt on spectral context effects. The direction of the shift obtained is consistent with the LTAS comparison. Figure 9. LTAS comparison between the carrier phrases “Abracadabra” of the LongVT and ShortVT and the ambiguous target from the “da” to “ga” series (dg5) in the region surrounding F3 (LTAS = LongVT, ShortVT; LPC dg5). 74 Further, these results provide evidence that these controlled stimuli (with equal settings for all parameters but length) are capable of eliciting shifts as a function of “talker.” * Figure 10. Results for Preliminary Experiment 1: Mean percent “da” responses for the LongVT and ShortVT synthesized versions of “Abracadabra.” Error bars indicate standard error of the mean. 3.4 Prediction #2: Only those talker differences resulting in a change in LTAS that corresponds to a differential contrast of an important phonetic cue will result in a shift in perception. 3.4.1 Preliminary Experiment 2 An important assumption of Prediction #2 is that not all shifts in talker identity (even if perceptible) should lead to shifts in target responses. Whether a change in talker will or will not result in a change in target perception should be predictable from a comparison of the LTAS carrier and target. That is, different talkers may not have different effects on target perception if their LTAS differences do not result in differential contrast effects when compared to the target’s spectrum. 3.4.2 Stimuli and Predictions The phrase “He had a rabbit and a” (1.3 sec) was synthesized from the four VT variations (Long, Short, O+P-, and O-P+). The targets consisted of the 12-step “bit” to 75 “bet” series (250 msec each) synthesized from the Standard VT. Based on the LTAS in the region of the first formant, it was predicted that the ShortVT would lead to more “bit” responses than the LongVT (Figure 8). The LTAS comparison did not predict any shift between the O+P-VT and the O-P+VT. The LTAS of each talker has a peak above the F1 of the ambiguous target, so their effects on perception would be predicted to be similar (Figure 11). 3.4.3 Participants and Procedure Sixteen subjects participated in this study. The carrier phrases + targets were separated into blocks. The Long and Short VT carrier phrases + targets were in Block 1. The O+P- and O-P+ VT carrier phrases + targets were in Block 2. There were five repetitions of each stimulus per block and participants went through each block twice Figure 11. LTAS comparison between the carrier phrases “He had a rabbit and a” of the O+P-VT and O-P+VT and the ambiguous target from the “bit” to “bet” series (bVt5) in the region surrounding F1 (LTAS = O+P-, O-P+; LPC = bVt5). 76 (order counterbalanced). This totaled 480 trials (1 phrase x 4 talkers x 12 targets x 10 repetitions). 3.4.4 Results and Discussion Results supported the LTAS predictions. Listeners perceived “bit” significantly more often following the ShortVT’s carrier phrase than when the targets followed the LongVT’s carrier phrase (t(15) = -5.42, p<.001) and there was no significant difference between the O+P- and O-P+ VTs (t(15) = .69, p>.05). Figure 12 and 13 show these results. * Figure 12. Results for Preliminary Experiment 2: Mean percent “bit” responses for the LongVT and ShortVT synthesized versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. Figure 13. Results for Preliminary Experiment 2: Mean percent “bit” responses for the O-P+VT and O+PVT synthesized versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. 77 These results support the LTAS account by demonstrating the power of this model to predict which talker differences will result in shifts in target perception and which talker changes will not. It is important to note that this is not the first instance where a change in talker has not led to a change in target perception. Dechovitz (1977b) ran a similar study using productions from two different male talkers and did not find a significant difference in target perception between the two carrier phrases. Darwin, McKeown, and Kirby (1989) also ran a series of studies investigating talker normalization and were not always successful in finding an effect of talker. If LTAS comparisons were made with these stimuli, it would be expected that the carrier phrases used would not differ substantially in their average energy and no spectral contrast effect would have been predicted. Thus far, the LTAS account is supported. 3.5 Prediction #3: Backward stimuli should cause the same effect as the same stimuli played forward. 3.5.1 Preliminary Experiment 3 According to the simplest LTAS account, a stimulus that differs from the carrier phrase but maintains the same LTAS should result in a similar shift in target responses. Since the LTAS of forward and backward speech would have equivalent LTAS, they should result in equivalent effects. This prediction was tested using reversed versions of some of the carrier phrases. These phrases would not be intelligible but would have identical LTAS. 78 3.5.2 Stimuli and Predictions The carrier phrases were the same as in Preliminary Experiment 2 (“He had a rabbit and a”). However, only the Long and Short VT were used, since their LTAS comparison showed an effect in the previous experiment. The same 12-step target series was used (“bit”-“bet”). 3.5.3 Participants and Procedure Sixteen subjects participated in this study. The stimuli were split into two blocks (Block 1 = forward stimuli; Block 2 = backward stimuli). There were five repetitions of each stimulus per block and participants were run in each block twice (order counterbalanced) for a total of 480 trials (1 phrase x 2 talkers x 2 directions x 12 targets x 10 repetitions). 3.5.4 Results and Discussion A 2x2 within-subjects ANOVA was used to analyze the data. The factors were vocal tract (LongVT, ShortVT) and direction (forward, backward). There was not a main effect of vocal tract (F(1,15) = 3.00, p>.05). There was a main effect of direction (F(1,15) = 17.09, p<.05) and a significant interaction (F(1,15) = 8.73, p<.05). Planned comparisons showed that listeners heard “bit” significantly more often following the ShortVT than following the LongVT (F(1,15) = 5.17, p<.05) when the phrases were played forward, but there was no significant difference when the phrases were played backward (F(1,15) = .58, p>.05). Results are shown in Figure 14 and 15. 79 The significant interaction means that the effect size of phrase was not equal for the forward and backward presentations. The forward effect was larger than the backward effect. These results conflict with the LTAS theory, as it predicted that a phrase played backward should have the same effect as that phrase played forward. One possible explanation is that there is a temporal window over which the LTAS is computed. That is, only the acoustic information within that window of time is used in the calculation of the LTAS. The information prior to that has no effect on target perception. This window may be shorter than the duration of the carrier phrases used. : * Figure 15. Results for Preliminary Experiment 3: Mean percent “bit” responses for the LongVT and ShortVT synthesized backward versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. Significance is marked for the planned comparison. : Figure 14. Results for Preliminary Experiment 3: Mean percent “bit” responses for the LongVT and ShortVT synthesized forward versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. 80 When the stimuli were played backward, the local spectral information (that is, the average spectral energy closest to the target) may have been different than when it was played forward, causing different effects in the backward and forward condition. It could also be the case that there is a temporal window but that the window is weighted, meaning the spectral information closer to the target has a greater representation in the LTAS than the spectral information more distant in time. Either one of these would explain this discrepancy in the results without discounting the LTAS theory. A third possibility is that carrier phrases without phonetic content do not elicit talker normalization effects. Whereas the first two possibilities suggest a change in how one calculates the LTAS for the current model, the last possibility would require a substantial reformulation, as the model is considered to be a general perceptual process and not specific to speech. 3.6 Prediction #4: The LTAS that is calculated is an extraction of the neutral vowel; therefore, a talker’s neutral vowel should have the same effect on perception as a phrase produced by that talker. 3.6.1 Preliminary Experiment 4 If targets are perceived relative to the LTAS and this LTAS is reflective of the neutral vowel, then, a carrier phrase and the neutral vowel from the same talker should have the same effect on the target series. However, if listeners do not extract the neutral vowel from the speech precursor, then, the neutral vowel and the carrier phrase may cause different effects or shifts in perception (if an LTAS comparison would predict one). 81 3.6.2 Stimuli and Predictions Both of the carrier phrases in this block were created from the ShortVT. One was simply the neutral vowel (400 msec) and the other was the phrase “Please say what this word is.” This carrier phrase was thought to be long enough and to include enough of the vowel space that a listener could extract the neutral vowel from it. The 10-step /dɑ/-/gɑ/ series served as the targets. The LTAS comparison of the two ShortVT carrier phrases and the ambiguous target is shown in Figure 16. An additional component to this experiment was a second condition in which the carrier phrases were played backward. If the LTAS is what is important and it is an extraction of the neutral vowel, it should not matter if the stimuli are played forward or backward. Listeners should be able to extract the neutral spectral representation from either and use this representation as a referent that they use when making phonemic judgments. Figure 16. LTAS comparison between the carrier phrases “Please say what this word is” and the neutral vowel of only the ShortVT, along with the ambiguous target from the “da” to “ga” series (dg5) in the region surrounding F3. (LTAS = Please; LPC = Neut, dg5). 82 3.6.2 Participants and Procedure Seventeen subjects participated in this study. The stimuli were split into two blocks based on the direction that the carrier phrase was played (forward or backward). Participants ran in each block twice (order counterbalanced) and there were 5 repetitions of each stimulus in each block. This made 400 trials total (2 phrases x 1 talker x 2 directions x 10 targets x 10 repetitions). 3.6.3 Results and Discussion A 2x2 within-subjects ANOVA was used to analyze the data. The two factors were phrase (“Please…,” neutral vowel) and direction (forward, backward). The main effect of phrase was significant (F(1,16) = .46, p<.05), but the main effect of direction was not (F(1,16) = 6.21, p<.05). The interaction was not significant (F(1,16) = .90, p>.05. Planned comparisons showed that there was no significant difference between the two phrases when they were played forward (F(1,16) = 3.48, p>.05), but there was a significant difference when they were played backward (F(1,16) = 7.18, p<.05). In the backward condition, the “Please…” phrase resulted in significantly more “da” responses. See Figures 17 and 18. The lack of a significant interaction means that the effect size of phrase was statistically equivalent for the forward and backward presentations. This supports Prediction #3 tested previously – that forward and backward stimuli have the same effect on target perception. Preliminary Experiment 3 did not find this. As suggested, there could be a temporal window over which the LTAS is calculated. If that is the case, it would be expected that the acoustic information closest to the target for both the neutral 83 Figure 17. Results for Preliminary Experiment 4: Mean percent “da” responses for the ShortVT synthesized forward versions of the neutral vowel and “Please say what this word is.” Error bars indicate standard error of the mean. * Figure 18. Results for Preliminary Experiment 4: Mean percent “da” responses for the ShortVT synthesized backward versions of the neutral vowel and “Please say what this word is.” Error bars indicate standard error of the mean. Significance is marked for the planned comparison. vowel and the “Please…” phrase is such that it would cause a shift in target perception whether it was played forward or backward. The acoustic information in this temporal window for stimuli from Preliminary Experiment 3, however, would not be such that it would predict the same effects but different ones when the stimuli were played forward versus backward. While Prediction #3 was supported, Prediction #4 (that the neutral vowel would elicit the same shift on target perception as a phrase produced by the same talker) was 84 not. A main effect of phrase was found with simple effects that revealed a significant difference between the phrases when they were played backward. This suggests that listeners may not be extracting the neutral vowel from a talker’s speech and then, using this as a referent from which they base their other perceptual judgments. This prediction needs to and will be investigated further. 3.7 General Discussion The results of the preliminary experiments generally supported the LTAS model as an explanation for effects of talker normalization. Changes in talker were capable of shifting target perception (Prediction #1), even for consonant contrasts. Also, the experiment showed that talkers created from the TubeTalker synthesizer were capable of eliciting talker effects and, so, are acceptable stimuli for this paradigm. The LTAS account also predicts that two carrier phrases by two different speakers may not lead to differences in target perception, if their LTAS do not differentially contrast with the target spectrum (Prediction #2). The results of Preliminary Experiment 2 supported this prediction: a substantial change in vocal tract (O+P- vs. O-P+) did not lead to a shift in target perception. These results indicate that LTAS comparisons between the carrier phrases and the ambiguous target member of a series are capable of predicting when a change of talker will and will not affect the listeners’ subsequent phoneme identification. Additionally, the LTAS account presumes that the LTAS of the carrier phrase is important for target categorization, not the phonemic content. This means that the same stimulus played backward or forward should have the same effect, since the LTAS of 85 each would be equivalent (Prediction #3). To test this, forward and backward stimuli were used in Preliminary Experiment 3 and Preliminary Experiment 4. The results of Preliminary Experiment 3 did not support this prediction, while the results of Preliminary Experiment 4 did. This led to the possibility that there is a temporal window (and/or weighting mechanism) over which the LTAS is calculated. This would mean that the information closest to the target would have the most impact on that target’s perception. Investigating the possibility of a temporal window is the first specific aim of this work (Chapter 4). Another prediction of the LTAS account is that the LTAS of a carrier phrase is an extraction of the neutral vowel and so both should have the same effect on target perception (Prediction #4). The results of the forward condition of Preliminary Experiment #4 supported this prediction. There was no difference in target perception between the carrier phrase and the neutral vowel. This prediction must be developed further through analyses of real productions to see if the LTAS of different phrases does resemble the neutral vowel. This is the second specific aim of the research (Chapter 5). Generally, it would seem that the LTAS model is successful at predicting when carrier phrases will lead to perceptual shifts and when they will not. However, it must be noted that these predictions have been coarse and solely based on visual comparisons between the LTAS of different carrier phrases and the ambiguous target. The specific circumstances that lead to a spectral contrast effect are unknown. It could be that the peaks in energy only lead to an effect when they are within a certain frequency range of one another. Or it is possible that the peaks are not as important as the troughs or lack of 86 energy in certain frequency regions. These questions need to be explored. Also, it is not clear whether the backward effects (or lack thereof) have to do with temporal window estimation or are evidence against the LTAS framework. In order to make quantitative predictions with specificity and to determine exactly what the LTAS model can account for, it is necessary to have a better model of what the LTAS representation is for the listener. Therefore, I propose three specific aims based on the first three hypotheses of the LTAS Model: Specific Aim #1: Determine the temporal window over which the LTAS is calculated Hypothesis #1: During carrier phrases, listeners compute a long‐term average spectrum (LTAS) for the talker. Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract. Hypothesis #2: The LTAS resembles the output of the neutral VT, that is, the average spectrum resembles the frequency response of the mean VT shape. Specific Aim #3: Determine what aspects of the comparison of the LTAS and the target predict the shift in target categorization. Hypothesis #3: Listeners represent target vowels relative to the LTAS of the carrier. 87 CHAPTER 4 THE TEMPORAL WINDOW Specific Aim #1: Determine the temporal window over which the LTAS is calculated Hypothesis #1: During carrier phrases, listeners compute a long‐term average spectrum (LTAS) for the talker. 4.1 Temporal Window Preliminary Experiments 3 and 4 used the same stimulus sets played forward and backward to determine whether listeners use the LTAS to make phonemic judgments on ambiguous speech sounds, irrespective of the phonemic content of the carrier phrase. Therefore, it was predicted that the same shift would be found in target perception, regardless of whether the carrier phrase was played forward or backward. This prediction was not borne out in Preliminary Experiment 3, suggesting that either the LTAS account is fundamentally wrong or that the LTAS used for predictions is wrong because it presumes an unweighted integration window that includes the entire carrier phrase. It may be that the temporal window over which the LTAS is calculated is smaller than the phrase or is weighted so that the information closest to the target is more important than the previous spectral information. If this is the case, the backward and forward conditions are not equivalent from the perspective of the LTAS, and so, differences between the two conditions are likely. To develop a viable quantitative model of the LTAS normalization in speech (and in other complex auditory perception), this information on the temporal window is necessary. The following set of experiments was 88 designed to test for the possibility of a finite temporal window and to provide a rough estimate of its duration if it exists. 4.2 Experiment 1 The purpose of this experiment is to investigate whether listeners only make use of LTAS information from a specific time window, in which case only information close to the target would have an effect on that target’s perception. This experiment used the same stimuli from Preliminary Experiment 3 (the two carrier phrases “He had a rabbit and a” synthesized from both the LongVT and Short VT, forward and backward). Listeners were not only exposed to the full carrier phrase (replicating Preliminary Experiment 3) but also heard shorter versions of the phrases. These shorter versions were the 100-msec and 325-msec portions that were closest to the target in the full phrase conditions. Note that these shorter durations do not have matching acoustic information. Since the phrases were played forward and backward, the backward conditions actually contained the acoustic information from the beginning of the phrase (See Figure 19). Because this information local to the target differed, any temporal window for LTAS computation that was shorter than the overall phrase length could result in differential effects. In particular, the speech closest to the target in the backward condition was the low-amplitude aspiration of the /h/ in “he.” If the temporal window is short (like the 100-msec backward stimulus used here), then it is unlikely that any differences in this low amplitude sound caused by vocal tract length would be large enough to affect target identification. 89 4.2.1 Methods 4.2.1.1 Stimuli and Predictions Carrier Phrases. The four carrier phrases from Preliminary Experiment 3 were used in this experiment (the Short and Long VT, backward and forward versions of “He had a rabbit and a”) to create three conditions: full phrase, 325-msec phrase, and 100msec phrase. The full condition contained the full carrier phrases (1.3 sec) in its backward and forward form. This condition was a replication of Preliminary Experiment 3. The two shorter-duration conditions contained stimuli made from both the backward and forward full-phrase condition. In Adobe Audition (Balo, et al., 1992), the last 100 msec and last 325 msec portion of each phrase were digitally spliced from the original phrase to make the new carrier phrases. This created twelve carrier phrases overall (2 VTs x 3 durations x 2 directions). Figure 19 shows a spectrogram and waveform of the forward full LongVT carrier phrase indicating where the stimulus was cut in order to create the shorter segments. Again, note that the backward and forward 100-msec and 325-msec phrases do not contain equivalent spectral information (i.e., the 100 msec backward phrase will have different spectral information than the 100 msec forward phrase), since the end of the backward stimuli contain the acoustics from the beginning of the phrase. Targets. The “bit” to “bet” target series was used for this experiment. The series was shortened to eleven steps (the last stimulus in the series was cut – bVt12) to ensure that the experiment would not take longer than one hour. The targets were appended to 90 Figure 19. Waveform and spectrogram of the LongVT’s production of “He had a rabbit and a.” The dashed vertical lines show the word boundaries. The horizontal lines show the segments that were used as carrier phrases in either the forwards or backwards condition for Experiment 1. each of the twelve carrier phrases after a 50-msec inter-stimulus interval. This created a total of 132 stimuli (12 carrier phrases x 11 targets). Predictions. Figure 20 shows the LTAS comparisons for all of the phrase pairs from which these predictions were derived. As mentioned earlier, one of the purposes of this research is to develop a model that can make more specific predictions about LTAS effects. That being said, these predictions are still rather coarse. In the LTAS comparisons, the region surrounding the target’s first formant is shown (0-1200 Hz) in Figure 20. The ambiguous target (the fifth member of the “bit”-“bet” series) is represented by a solid line. The phrases from the LongVT are represented by dotted lines 91 and those from the ShortVT are represented by dashed lines. If the LTAS of a carrier phrase has both a trough (region of low amplitude) near the frequency of the target’s F1 and a gradual incline that peaks above or below the frequency of the target’s F1, the carrier phrase is predicted to have a contrastive effect on target perception. In the case of this ambiguous target, a perceived lowering of its F1 (due to a carrier phrase having a peak higher in frequency) would result in more “bit” responses. A perceived raising of its F1 (due to a carrier phrase having a peak lower in frequency) would result in more “bet” responses. From the LTAS comparisons, then, it would be predicted that both the forward and the backward full-phrase condition would result in significantly more “bit” responses for the ShortVT (Figure 20 – A). However, since the full-phrase condition is a replication of Preliminary Experiment 3, it is expected that the backward effect will not be significant. This null effect led to the idea that there may be a temporal window over which the LTAS is calculated and that the local information (the part of the carrier phrase closest to the target) might be responsible for the effect. The LTAS comparisons for these shorter durations are shown in Figure 20 – B-E. From both of the short forward conditions (325 msec and 100 msec), the same effect as the full-phrase would be predicted – more “bit” responses for the ShortVT. The LTAS ShortVT for the backward 325-msec condition is somewhat atypical, as there is not a trough but a small plateau at the frequencies where the peak of the target’s F1 is. It is unclear whether or not this would have an effect on the target’s perception. The LTAS comparison for the backward 100-msec condition does not predict an effect, as the spectral energy for both of the 92 93 Figure 20. LTAS comparisons for the stimuli in Experiment 1 and Experiment 2. They are between the different durations of the carrier phrases “He had a rabbit and a” of the LongVT and ShortVT and the ambiguous target from the “bit” to “bet” series (bVt5) in the region surrounding F1. A-E are relevant for Experiment 1. A-C are relevant for Experiment 2. The LTAS comparisons are as follows: A) The Full Phrase; B) Forwards 325-msec segments; C) Forwards 100-msec segments; D) Backwards 325-msec segments; and E) Backwards 100-msec segment. Note that the axes are equivalent, except in the case of comparison E. These 100-msec segments were much lower in amplitude, so the amplitude range was adjusted to include both the 100-msec carriers and the target. (Only the Full phrases in A are using LTAS. The others are LPC.) 94 carrier phrases is distant from the target’s F1 (note that the y-axis shows a greater range of relative amplitude than the other four images). If one of these durations (either the 325-msec or the 100-msec) is the time window over which the LTAS is calculated, then the effects should follow the same pattern as the forward and backward full-phrase condition. (If replicated from Preliminary Experiment 3, this means that the forward condition will show an effect and the backward condition will not.) Expanding the window to include more of the carrier phrase should also not change the pattern of results. For example, if 100-msec appears to be the temporal window, then 325-msec should have the same effects on target perception. This is because only the last 100-msec portion of the 325-msec would be used in the calculation of the LTAS and this would match the acoustic information of the 100-msec stimuli played alone. Thus, the same effect would be expected in both cases. 4.2.1.2 Participants and Procedure Thirteen participants took part in this study. The stimuli were split up into three blocks based on duration (Block 1 = all 100-msec stimuli, Block 2 = all 325-msec stimuli, Block 3 = all full-length stimuli) so that participants could take breaks, if needed. Participants were run in each block twice (order counterbalanced) and each block presented four repetitions of each stimulus for a total of 1,056 trials (1 phrase x 2 talkers x 2 directions x 3 duration lengths x 11 targets x 8 repetitions). 4.2.2 Results For the forward full-phrase condition, the ShortVT resulted in significantly more “bit” responses than the LongVT (t(12) = -2.81, p<.05), but no significant difference was 95 found for the backward condition (t(12)=-.35, p>.05). The 325-msec condition showed a significant difference for both the forward (t(12)=-4.63, p<.005) and backward (t(12)=3.63, p<.005) stimuli (significantly more “bit” responses for the ShortVT in both). The forward 100-msec condition also resulted in significantly more “bit” responses following the ShortVT (t(12)=-3.19, p<.005), but there was no difference between the two vocal tracts for the 100-msec backward carrier phrases (t(12)=-1.01, p>.05). Figure 21 shows these results graphically. 4.2.3 Discussion Replicating the findings of Preliminary Experiment 3, there was a significant difference between the Long and Short VT when the full phrase was played forward but not backward. For the 100-msec condition, the pattern was the same: the forward effect was significant while the backward effect was not. On the other hand, in the 325-msec condition, both forward and backward were significant. Neither of these durations appears to approximate the temporal window over which the LTAS is calculated. If one of them had been, then the stimuli with longer durations should have resulted in the same pattern of effects. There are certain limitations of these stimuli and the results that can be pulled from them, though. Specifically, one drawback of the 100-msec backward stimulus was mentioned previously – that it consisted of the /h/ of “he.” To listeners, this sounded like a very short noise burst and did not contain strong spectral energy (as seen by the LTAS comparison – Figure 20 – E). It seems, then, that such a short stimulus, containing very 96 Figure 21. Results for Experiment 1: Mean percent “bit” responses for the full phrase and shorter, spliced versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. 97 weak spectral information would not change substantially as a function of talker and, therefore, not cause a shift in target perception. Experiment 2 tries to account for this issue by using a voiced vowel for both the forward and backward 100-msec stimulus. More generally, though, the results do follow the predictions made based on the LTAS comparisons. Specifically, when the ShortVT carrier phrases had a trough in the target’s F1 region that gradually increased in amplitude and peaked higher in frequency than the F1, the carrier phrase resulted in more “bit” responses. The only LTAS comparison that did not have a clear prediction was the backward 325-msec condition. The ShortVT did not have a trough in the region of the target’s F1, but rather had a plateau right in the region of the peak (see Figure 20) that then peaked higher in frequency. The results showed that this plateau was enough to elicit a shift in perception, as this condition resulted in significantly more “bit” responses for the ShortVT, as well. Future predictions based off of LTAS comparisons will take the plateau-effect into consideration. 4.3 Experiment 2 In Experiment 1, the effects of the local information to the target were compared to each other. This provided an indication of the effectiveness of the local information in the backward and forward conditions of Preliminary Experiment 3. However, this necessitated that the LTAS of the conditions were all different. In Experiment 2, the LTAS is maintained by reversing the same portion of the signal in the smaller time windows. Windows of 100 and 325 msec were again used. 98 4.3.1 Methods 4.3.1.1 Stimuli and Predictions Carrier Phrases. Figure 22 shows a spectrogram and waveform of the forward full LongVT carrier phrase with marks where the stimulus was cut to create the shorter segments. The stimuli were created in the same way as those in Experiment 1. The only difference was that the backward shorter-duration stimuli were now reversed versions of the forward shorter-duration stimuli. Targets. The eleven-step “bit” to “bet” target series was used again. The targets were appended to each of the carrier phrases after a 50-msec inter-stimulus interval. This created a total of 132 stimuli (12 carrier phrases x 11 targets). Predictions. Figure 20 (A-C) shows the LTAS comparisons relevant to this experiment (full-length, forward 325 msec and forward 100 msec). Since the stimuli used are from Experiment 1 and the forward and backward conditions now each contain equivalent spectral information, only one LTAS comparison is needed for each duration. Based solely on the LTAS comparisons, the predictions are the same as in Experiment 1: Listeners will hear “bit” more often following the ShortVT than when the targets follow the LongVT. This is true across all conditions, since each LTAS comparison shows that the ShortVT had a trough in the target’s F1 region that gradually increased and peaked higher in frequency. The carrier phrase’s higher peak in frequency would cause the target’s F1 to be perceived lower in frequency and more “bit”-like. From the results of the previous experiments, however, it is expected that the full-phrase condition will only result in a significant shift when it is played forward. The 100-msec and 325-msec 99 conditions, however, should results in significant effects when played both forward and backward. 4.3.1.2 Participants and Procedure Twelve participants took part in this study. The procedure is the same as in Experiment #1. The stimuli were split into three blocks based on the duration (Block 1 = all 100-msec stimuli, Block 2 = all 325-msec stimuli, Block 3 = all full-length stimuli) so that participants could take breaks, if needed. Participants were run in each block twice (order counterbalanced) and each block presented four repetitions of each stimulus for a Figure 22. Waveform and spectrogram of the LongVT’s production of “He had a rabbit and a.” The vertical dashed lines show the word boundaries. The horizontal lines show the segments that were used as carrier phrases in either the forwards or backwards condition for Experiment 2. 100 total of 1,056 trials (1 phrase x 2 talkers x 2 directions x 3 duration lengths x 11 targets x 8 repetitions). 4.3.2 Results Again, replicating the results for the full-length condition, “bit” was heard significantly more often following the ShortVT than when it followed the LongVT when the stimuli were played forward (t(11)=-3.13, p<.05), but no significant difference was found between the two when the stimuli were played backward (t(11)=-.56, p>.05). For both the 325-msec forward and backward, “bit” was also heard significantly more often following the ShortVT than when it followed the LongVT (forward: t(11)=-5.11, p<.005; backward: t(11)=-2.66, p<.05). Neither of the 100-msec conditions showed a significant difference between the two TubeTalker configurations (forward: t(11)=-1.65, p>.05; backward: t(11)=-.42, p>.05). Figure 23 shows these results graphically. Since the significant effect found in the 325-msec backward condition appeared to be smaller than the forward condition, this was tested with a 2x2 within-subjects ANOVA. The first factor was vocal tract (ShortVT, LongVT) and the second factor was direction (backward, forward). The main effect of vocal tract was significant (F(1,11) = 30.77, p<.005) and the main effect of direction was significant (F(1,11) = 9.76, p<.05). The interaction was also significant (F(1,11) = 6.18, p<.05). Planned comparisons revealed that significantly more “bit” responses followed the ShortVT than the LongVT in both the forward (F(1,11) = 26.10, p<.005) and the backward condition (F(1,11) = 7.05, p <.05). 101 Figure 23. Results for Experiment 2: Mean percent “bit” responses for the full phrase and shorter, spliced versions of “He had a rabbit and a.” Error bars indicate standard error of the mean. Significance is marked for the t-test analyses. 102 4.3.3 Discussion While the full-length condition for the “He had a rabbit and a” carrier phrase has proven to be a stable effect (now replicated three times) and the 325-msec effect was found for the forward and backward condition as expected, the 100-msec effect was not found for the forward or the backward condition. This is what is predicted from an LTAS account: the same effect should be found, irrespective of whether a stimulus is played forward or backward. The spectral content of the LTAS is the same and so the effect should be the same. The effects found for the 325 and 100-msec conditions fit this, because the results were the same when the stimuli were played forward and backward (both significant for 325 msec, both not significant for 100 msec). However, the results from the 100-msec condition did not replicate the effect from Experiment 1. In Experiment 1, the 100-msec forward condition was significant. It is possible that 100 msec is too brief of a stimulus to show a consistent effect across participants. The LTAS mechanism may need a stimulus of a longer duration to generate a stable LTAS by which comparisons can be made. The results of the 325-msec condition suggest that by 325 msec such a stable LTAS has been calculated, but the interaction showed that the effect was significantly smaller in the backward condition than in the forward condition. This was found also for Preliminary Experiment 3. It may be that stimuli do not elicit strong spectral contrast effects when the listener does not recognize them as speech. This possibility is discussed in the General Discussion. Again, though, the results from the 325-msec condition do not match the pattern found for the full-phrase condition, so neither duration seems to approximate that of the temporal window. 103 4.4 Experiment 3 The previous experiments provided a paradigm for examining the duration of an integration window. If increasing the window does not change the effect size then one can presume that the effective integration window has been reached. But it is more difficult to use this paradigm to look at the question of whether the local information is more heavily weighted in the computation of the LTAS. This experiment provides a paradigm for mapping out the weighting of the integration window. Vowels produced by the same talker (predicted to have different effects based on LTAS comparisons) can be appended as the carrier phrase. If the local information dominates, then the obtained shifts should be predictable from the vowel closest in time to the target, whereas if the window uses a weighting function, shifts based on the local talker should be offset by the addition of the other vowel(s) to the computation of the LTAS. 4.4.1 Methods 4.4.1.1 Stimuli and Predictions Carrier Phrases. Eight different carrier phrases were created by appending vowels from the LongVT to one another. The vowels /i/ and /ɑ/ for the LongVT were chosen, because LTAS comparisons (see Figure 24) between these and the ambiguous target from the “bit”-“bet” series had the most straightforward predictions. The LongVT /ɑ/ has a peak in energy right above the first formant of the ambiguous target. This would contrastively push the effective F1 of the ambiguous target down, leading to more “bit” responses. The /i/ would have the opposite effect. Its peak is below the target’s F1, so this would contrastively push the effective F1 up, causing more “bet” responses. 104 Since the previous two experiments did not result in a definitive duration for the temporal window, the duration for each vowel was chosen to be 250 msec (in-between both the 100-msec and 325-msec durations used previously). The TubeTalker was used to synthesize these vowels and then in Adobe Audition (Balo, et al., 1992), the level of the vowels was set to the same amplitude as the vowel portion of the ambiguous target (bVt5). The vowels were then appended together with 15-msec of silence in-between them. There was one short condition (only one vowel), two medium conditions (Medium1: /i/ appended to /i/, /ɑ/ appended to /ɑ/; Medium 2: /i/ appended to /ɑ/, /ɑ/ appended to /i/), and one long condition (/i/-/i/-/ɑ/ and /ɑ/-/ɑ/-/i/). These created the eight different carrier phrases. Targets. The original twelve-step “bit” to “bet” target series was used again. The targets were appended to each of the carrier phrases after a 50-msec inter-stimulus interval. This created a total of 96 stimuli (8 carrier phrases x 12 targets). Predictions. If listeners use a weighted window, then the acoustic content closest to the target will have the greatest effect on target perception. However, depending on the acoustic content of the sounds preceding it, the effect size may change. The vowels /i/ and /ɑ/ have opposite predicted effects on the target. So, for example, if the vowels are played alone before the target, the /i/ should get more “bet” responses and the /ɑ/ should get more “bit” responses. If the vowels are played one after another, the vowel closest to the target should have the most effect, but, because the other vowel does have some impact, it will pull perception in the opposite direction and lessen that effect. The 105 Figure 24. LTAS comparisons for the stimuli in Experiment 3. The images show the vowel sequences of /i/ and /a/ synthesized from the LongVT (LPC used for all comparisons). 106 specific predictions for each condition for both an unweighted and weighted window are as follows: Since there is only one vowel for the Short condition, it should follow the LTAS prediction stated above: /ɑ/ should lead to more “bet” responses and /i/ should lead to more “bit” responses. This should be the same for the Medium1 condition (/ii/ vs. /ɑɑ/), since the vowels (and, thus, spectral content) are the same. For the Medium2 condition (/iɑ/ vs. /ɑi/), the vowel at the end should still have the most impact on the target’s perception. However, the other vowel (having the opposite predicted effect than the last vowel) should decrease the overall size of the effect when compared to the Short condition. This should be similar to what would happen for the Long condition (/iiɑ/ vs. /ɑɑi/). The last vowel should have the most effect if the window is weighted, but the previous two vowels should diminish it, since they would pull a listeners’ perception in the opposite direction. If listeners do not weight the spectral information, then their responses should always be based on the spectral average of the entire stimulus. The order of the vowels should not matter. The Short and Medium1 conditions should still result in more “bit” responses for the /ɑ/ or /ɑɑ/ vowels and more “bet” responses for the /i/ or /ii/ vowels. The Medium 2 should not result in a difference between the two vowel sequences, since /iɑ/ and /ɑi/ have the same spectral average. The Long condition should also result in no shift between the two carriers, since they have such similar spectral content. 107 4.4.1.2 Participants and Procedure There were fifteen participants in this study. The stimuli were split into four blocks based on the vowel sequences (Long, Medium1, Medium2, and Short). Participants ran in each block once and there were eight repetitions per block for a total of 768 trials (1 talker x 8 vowel sequences x 12 targets x 8 repetitions). Figure 25. Results for Experiment 3: Mean percent “bit” responses for vowel sequences synthesized from the LongVT. Error bars indicate standard error of the mean. 108 4.4.2 Results No significant differences were found between any of the paired vowel sequences: Long (/i/-/i/-/ɑ/ and /ɑ/-/ɑ/-/i/): t(14) = 1.28, p>.05; Medium1 (/i/-/i/ and /ɑ//ɑ/): t(14) = .24, p>.05; Medium2 (/i/-/ɑ/ and /ɑ/-/i/): t(14) = .63; and Short (/i/ and /ɑ/): t(14) = .58, p>.05). Figure 25 displays these results graphically. 4.4.3 Discussion Although specific LTAS comparison predicted null effects, it was surprising that no shifts in perception were found. It may be that the stimuli were not of the correct type or did not have the right acoustic characteristics to elicit an effect. Watkins (1991) and Watkins and Makin (1996a; 1996b) have found evidence that certain types of carrier phrases do not elicit strong perceptual effects. In particular, it may be that carrier phrases need more spectro-temporal variation to elicit a spectral contrast effect. This idea will be discussed at length below. 4.5 General Discussion Initially, the LTAS model predicted that backward stimuli should have the same effect as the same stimuli played forward. This was tested in Preliminary Experiment 3 and 4, and the results showed otherwise. While Preliminary Experiment 4 found that effects were the same, Preliminary Experiment 3 did not. These findings led to the suggestion that the LTAS may be calculated over a specific time window and if this is the case, the spectral information closest to the target would affect target perception. This would provide an explanation for the differences between the previous backward and forward conditions, since the local information (the acoustics closest to the target) would 109 change depending on if the stimulus was played forward or in reverse. Further, it was necessary to discern whether the temporal window was weighted or unweighted – that is, to find out whether the acoustic information closest to the target had the most affect within the temporal window. The purpose of these three experiments was to determine the temporal window over which the LTAS is calculated and then, to determine whether the window had a weighted or unweighted function. This proved to be more complex than originally thought. The first two experiments used the full phrase (“He had a rabbit and a”) and two shorter, spliced versions of that phrase (325 msec and 100 msec) synthesized from the LongVT and ShortVT to try to determine the duration of the temporal window. In Experiment 1, the shorter-duration backward phrases had different acoustic content and in Experiment 2, they had matching acoustic content. The prediction was that if one of the shorter durations was the temporal window, then extending the window for a longer amount of time would not show a different pattern of effects than the short duration. This was not found for either of these shorter durations, so the temporal window still remains unknown. The third experiment tried to determine if the temporal window used a weighting function or not. The stimuli consisted of a sequence of vowels synthesized from the LongVT (sequences varied in length from one vowel to three vowels). The vowels were determined by LTAS comparisons that predicted that they would have different effects on the target’s perception. Further, the vowel sequences were created so that there would be 110 clear predictions that could tease apart whether temporal window was weighted or not. Unfortunately, no effects were found across any of the conditions. Based on these three experiments, it is still unclear whether there is a specific temporal window and if so, whether there is a weighting mechanism attached to it. It is possible that these stimuli did not have the right characteristics or acoustic structure to initiate the LTAS mechanism. There could be some differences between the carrier phrases (the reversed phrases and the vowel sequences) and the target that led to mixed or null effects. One of the important ideas behind the model the LTAS model is that it is a result of a general auditory process. Part of this idea came from the work of Holt (2005; 2006) who used a series of tones that either had a high-average frequency or low-average frequency to shift listeners’ perception of a target. If tones could shift the perception of a target phoneme, then clearly it would be expected that synthesized speech could, as well. Even reversed speech would be expected to cause a shift in perception. It may not be intelligible, but would still have similar spectral properties of speech. The preliminary experiments showed that, indeed, synthesized speech (specifically from the TubeTalker) could cause shifts in perception. Reversed speech, however, was not as successful. The most consistent finding from the experiments so far is that the full phrase: “He had a rabbit and a” does not show an effect of talker when it is played backward, even though it does when it is played forward. The vowel sequences were not successful, either. There are two possible explanations for why these stimuli may not have caused an effect: (1) The LTAS mechanism does not calculate a spectral average for all sounds but only 111 for sounds with certain properties; or (2) The LTAS mechanism calculates a spectral average for all sounds but, depending on how whether or not that sound is identified as being similar to the target, it may or may not affect its perception. This idea that a sound’s identification may separate it from the target comes from work on stream segregation (Bregman, 1990). In the case of the reversed speech, even though it is unintelligible, it still is well-formed and has spectro-temporal cues that could bind it together so that the listener perceives it as a separate “object” and separates it from the target. This might explain why when the full “He had a rabbit and a” phrase is played to listeners in reverse, it does not have an effect on target perception. It is bound together and separated from the target and so has no perceptual effect on it. Watkins (1991) ran a series of studies to investigate spectral contrast effects and the types of carrier phrases that can elicit these effects, which included playing stimuli in reverse. Unlike the results of Experiment 1 and 2, he found an effect for reversed speech on vowel perception (“itch” to “etch” series) , which suggests that it may not be segregated separately from the target (and there were effects found for the 325-msec stimuli in Experiment 1 and 2). His carrier phrase was 1 second in length, which is very similar to the duration of the carrier phrase used in Experiment 1 and 2 (1.3 sec), so it cannot be due to a difference in the durations (one maybe providing a more/less stable LTAS than the other by giving the mechanism more time to calculate it). However, he did not find an effect for noise carrier phrases that were filtered to have the same spectral envelope (or LTAS) as his speech carrier phrase or for repetitions of the same vowel sound. This led Watkins (1991) to suggest that a carrier phrase must 112 be more similar to running speech and have a “time-varying” spectral envelope, rather than a “time-stationary” spectral envelope, to cause a shift in target perception. This may explain why the vowel sequences did not elicit an effect; however, Watkins’ vowel sequences were four repetitions of the same vowel. The stimuli used in Experiment 3 changed between two vowel sounds, although within each vowel the spectral content was static. Even if this explains the lack of effects for Experiment 3, Watkins’ results still do not clear up why there was no significant result for the backward full phrase condition. This still leaves the possibility open that there is a time window for the LTAS mechanism. Watkins’ carrier phrase could have just had the right spectral content in that time window to elicit a backward effect. It seems that the question of whether the LTAS is calculated over a specific duration is still undetermined and needs to be investigated further. 113 CHAPTER 5 EXTRACTING THE NEUTRAL VOWEL Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract. Hypothesis #2: The LTAS resembles the output of the neutral VT, that is, the average spectrum resembles the frequency response of the mean VT shape. 5.1 Extracting the Neutral Vocal Tract Story and Titze (1998) and Story (2005a) acquired MRI images of the vocal tract airspace shape of ten sustained vowels for one male talker. The cross-sectional area for each of the vowel shapes was analyzed, using principal components analysis (PCA). PCA begins with subtracting the mean (in this case the average cross-sectional area), and then determines components or “modes” that capture most of the variance of changes around the means. In his analysis, Story found that two modes accounted for over 90% of the variance across the ten vowels. Remarkably, these same two modes account for most of the variance for other talkers (2005b). Thus, the main individual differences in vocal tract shape are captured in the mean shape (the average across different vowel shapes), which Story refers to as the “neutral vocal tract.” From a talker normalization standpoint, the finding that the neutral vocal tract shape is the main difference between talkers provides a possible answer to how listeners can accommodate for the variance between speakers. If listeners could estimate the neutral as a referent, then they could “neutralize” its effect on subsequent productions. 114 One possibility is that the average spectral energy (LTAS) that a listener calculates when he/she hears someone speak actually reflects the output of a neutral vocal tract. That is, the average of the acoustic output of an individual over time may be similar to the spectral content of his/her neutral vowel. If a listener can extract the neutral vowel from a talker’s output, the listener may be able to use it as a referent from which he/she can base future phonemic judgments. There are three specific predictions that can be derived from this hypothesis: (1) If the neutral vowel is extracted through the additive spectral content of a specific talker’s speech productions, then the LTAS of running speech should resemble the spectrum of the neutral vowel. (2) The more the talker moves around the articulatory space, the more likely the LTAS will come to resemble the neutral vowel. (3) If the neutral vowel is extracted, then, it should lead to the same quantitative shift in target responses as any phrase from that talker. These predictions are examined in this series of experiments. 5.2 Experiment 4 This experiment tests the first prediction: If the neutral vowel is extracted through the additive spectral content of a specific talker’s speech productions, then the LTAS of running speech should resemble the spectrum of the neutral vowel. The “running speech” for this study consisted of productions from four of the talkers (two males and two females) from Story (2005b). Particularly, these talkers were used since the shape of their neutral vocal tract had been derived from analyses of the MRI images. A 115 synthesized version of each talker’s neutral vowel was created from these known neutral shapes and compared to the average spectral content of the running speech. It was predicted that the spectral average of the running speech should resemble the spectral average of the neutral vowel. Further, the addition of more running speech should provide more spectral information for that talker and, thus, improve the resemblance between the running speech’s and neutral vowel’s spectrum. 5.2.1 Methods 5.2.1.1 Signals and Predictions Signals. The running speech consisted of recordings of 11 sentences from four speakers (2 females, 2 males). These four talkers were SF1, SF2, SM1, and SM2 from Story (2005b) (referred to as F1, F2, M1, and M2 in this paper; F=female, M=male). These recordings came from one of the practice sessions that the participants went through before MRI images were obtained from them. The practice sessions were meant to simulate what it would be like in the scanner, so the participants were lying down and wore earplugs during the recordings. The sentences recorded were not picked specifically for this project but for the author’s purposes at that time. While these sentences are not phonetically balanced, they do contain a wide range of vowels, including the point vowels (/iɑu/). A list of the sentences can be seen in Table 2. To create the longest segment of running speech, all eleven sentences were appended to one another in a random order for each subject (the order is also listed in Table 2). To test whether the addition of more speech samples from a particular talker 116 # 1 2 3 4 5 6 7 8 9 10 11 Sentence F1 F2 M1 M2 I need to catch the eight o’clock flight to New York. 1 5 3 7 She danced like a swan, tall and graceful. 7 7 4 5 You were away. 5 1 11 3 That noise problem grows more annoying each day. 9 2 2 2 We were in Iowa while you were away in Ohio. 4 4 1 8 The heat in the desert can be unbearable. 6 6 10 1 They all know what I said. 8 3 5 6 Things in a row provide a sense of order. 10 11 7 10 I’ll make sense of the problem in a moment. 11 8 6 4 We were away a year ago. 3 9 9 11 If I had that much cash, I’d buy the house. 2 10 8 9 Table 2. A list of the sentences recorded from the four participants and the order in which each of the sentences was appended for each talker to make the stimuli. makes the LTAS more similar to the neutral, the appended sentences were spliced to make stimuli that went from shorter to longer durations (the longest duration being all 11 sentences). The sentences had a total of 88 words and the shorter to longer stimuli were created by splicing after every 8 words. These segments were created for all of the speakers up to the full duration of the entire stimulus. The segments used in the analyses, were only the first, third, sixth, and eleventh (referred to as Sent1, Sent3, Sent6, and Sent11). Table 3 shows the cumulative duration in seconds for the segments of each talker. F1 Sent1 2.04 Sent3 6.43 Sent6 14.20 Sent11 27.42 F2 2.11 6.50 14.07 25.05 M1 2.81 7.57 14.74 26.23 M2 1.59 6.53 13.61 25.58 Avg. 2.14 6.78 14.16 26.07 Table 3. The duration of each stimulus used in the LTAS comparisons with the neutral vowel from each talker. 117 The neutral vowels were synthesized with TubeTalker with the same area functions as each talker’s neutral vocal tract shape with a source for the female talkers’ set to 210 Hz and the male talkers’ set to 105 Hz. Each neutral vowel was one second in length. 5.2.1.2 Procedure For each speaker, the LPC of the neutral vowel was compared to the LPC of the different segments that went from only eight words to all 88 words. Figures 26 – 29 show the comparisons. Correlational analyses were done on each comparison to see if the spectra of the neutral and the appended sentences had a similar pattern of amplitude changes across frequencies and to see if the two were more highly correlated when the appended sentences were of a longer duration. The neutral-only stimuli contained spectral information up to 5,000 Hz, so the analyses were calculated on a window from 04,000 Hz to avoid the drop in spectral energy that started in the 4,000-5,000 Hz region. Also, since the neutral vowel was best represented by the LPC spectrum (its duration was less than 1 second), the numbers used for the comparisons were extracted from the LPC of both the neutral and the different sentence lengths for consistency (sampled at 44,100 Hz). 5.2.2 Results The results from the correlation are found in Table 4. Neutral vs. Female1 Female2 Male1 Male2 Sent1 .23 .68 .53 .31 Sent3 .62 .59 .62 .36 Sent6 .61 .58 .60 .34 Sent11 .68 .56 .62 .33 Table 4. Results of the correlational analyses between the different sentence lengths of the stimuli and the neutral vowel for each speaker. 118 Figure 26. LPC comparisons between Female1’s neutral vowel and various sentence segments. 119 Figure 27. LPC comparisons between Female2’s neutral vowel and various sentence segments. 120 Figure 28. LPC comparisons between Male1’s neutral vowel and various sentence segments. 121 Figure 29. LPC comparisons between Male2’s neutral vowel and various sentence segments. 122 5.2.3 Discussion The prediction was that the longer the duration of the appended sentences, the more closely correlated it would be with the neutral vowel. The pattern would then be that the correlations grow stronger for each longer segment. However, the pattern of results across the talkers does not support this prediction. Female1 follows the expected pattern the most closely. Her weakest correlation is for Sent1 and her strongest correlation is for Sent11; however, her Sent3 and Sent6 are almost the same. Female2 has the exact opposite pattern than what was expected. Her correlations get weaker as the duration of the stimulus gets longer. Male1’s weakest correlation is for Sent1 but plateaus after that and Male2 has the weakest correlations with the neutral out of all of the talkers. His correlations are also the most similar for each segment duration. The weakness of these correlations do not provide a strong case for the prediction that the neutral vowel is extracted from running speech. However, while these correlations are weak, it does appear that the frequency of the peak location for both the neutrals and the running speech line up well in some cases, especially for Male1. It may be that correlations are not a sensitive enough measure to determine how similar/dissimilar the neutral vowel and running speech are to each other. Also, these analyses included all of the speech sounds and pauses. The speech sounds that create noise (like /s/ and /ʃ/) can lessen the prominence of the peaks and the silent pauses would also lower the average of the LTAS. It may be that these types of sounds and silences should be taken out before comparing the running speech to the neutral vowel. 123 5.3 Experiment 5 Experiment 4 tested whether or not running speech could provide an estimate of the neutral vowel. The stimulus set used in Experiment 4, though, was not phonetically balanced and so may not have provided a large enough or equally distributed sample of the possible productions from a talker. This experiment uses more controlled stimuli synthesized from the LongVT and ShortVT to test Prediction #2: The more the talker moves around the articulatory space, the more likely the LTAS will resemble the neutral vowel. From this, one could predict that the best case for determining the neutral vowel would be a sampling of the entire vowel space. To get this sampling, the stimuli used in this experiment were vowel trajectories synthesized from the Long and Short VT. The vowel trajectories moved from /i/ to /ɑ/ to /u/ and were compared to the spectral average of the neutrals synthesized from the same TubeTalkers. To determine whether the entire vowel space is necessary for listeners to extract the neutral vowel, LTAS comparisons were not only made between the full trajectory and the neutral vowel but also at shorter time segments. If the entire sampling is necessary, then the LTAS comparisons should show that the longer the segment (i.e., the more vowels sampled), the closer the match between the LTAS of the neutral and of the vowel trajectory. There are two different ways in which the LTAS comparisons of the neutral and the vowel trajectory could be similar. The first is that the spectra of the neutral and the vowel trajectory could have a similar overall pattern of amplitudes across frequencies (corresponding peaks and valleys). To determine this, correlations between the spectra for the neutral and the vowel trajectory were done to provide an index of whether this 124 overall pattern was similar. The second possibility is that the amount of spectral energy between the two is similar. To investigate this, RMS differences were calculated to provide an overall “distance” measure between the two spectra. If a full sampling of the vowel space is necessary for the average to resemble the neutral vowel, then the more vowels available, the higher the correlation should be and/or the lower the RMS difference. 5.3.1 Methods 5.3.1.1. Signals and Predictions Signals. The signals used in the LTAS comparisons were the neutral vowel and vowel trajectory synthesized from the LongVT and ShortVT. The neutral vowels were 1.1 sec in length (matching the duration of the vowel trajectories). The vowel trajectories were created to move from the vowel /i/ to /ɑ/ to /u/. Essentially, the vowel trajectory was the acoustic output of the TubeTalker constantly changing shape through the different mode combinations required to make the vowel sounds in the trajectory. The trajectory of the Long and Short VT were spliced at different time intervals from the beginning of the stimulus to create four different trajectories for comparison (see Figure 30). These time intervals were: 100 msec, 300 msec, 600 msec. Including the full vowel trajectory (1.1 sec), this created four stimuli from the LongVT and ShortVT for comparison with their respective neutral vowels. 125 Figure 30. The vowel trajectory synthesized from the LongVT. The lines mark the time points where the stimulus was spliced to create the stimuli for comparison to the LTAS of the neutral vowel. Predictions. If the average of a talker’s acoustic output resembles that talker’s neutral vowel, then the more samples of his/her articulations, the higher the possibility that the average will reflect the neutral vowel. In particular, then, the more of the vowel space that is sampled from the TubeTalker the higher the correlation and/or the lower the RMS value. That is, the full vowel trajectory should be the most similar to the LTAS of the neutral than the shorter segments’ LTAS, since it will be the average of the vocal tract shape moving through all of its possible articulations in the vowel space. Further, it should be more likely that the spectrum of a talker’s vowel sampling reflect that talker’s 126 own neutral rather than that of another talker. So, the spectrum of the LongVT’s vowel trajectory should be more similar to its neutral than to the neutral of the ShortVT and vice versa. 5.3.1.2 Procedure Due to the short nature of the stimuli, analyses of the LPC for the stimuli were used in the calculations. Comparisons were made between each of the four vowel trajectory segments and the neutral vowel for that TubeTalker (either the LongVT or ShortVT). Comparisons were also made between the full vowel trajectory (1.1 sec) and the neutral of the opposite vocal tract. The comparisons consisted of correlation analyses and RMS difference calculations. All of the analyses focused on the first formant region of the stimuli (0 – 1200 Hz) to see if the first spectral peak lined up for both the vowel trajectory and neutral. This region was chosen since it contained the frequency region considered important for spectral contrast effects with the “bit/bet” series. If listeners use the LTAS as a referent for their perceptual judgments and this reflects the neutral vowel, then the peak in this area should be similar. See Figure 31, 32, and 33 for graphical representation of the LPC comparisons. 5.3.2 Results The correlation values and the RMS difference between the neutral and each vowel trajectory segment are shown in Table 5. 127 LongVT Neutral vs. 100 ms 300 ms 600 ms 1.1 sec Short – 1.1 sec Corr. -.84 -.17 .824 .80 .89 RMS 22.18 14.09 6.36 5.51 4.41 ShortVT Neutral vs. 100 ms 300 ms 600 ms 1.1 sec Long – 1.1 sec Corr. RMS -.84 .163 .97 .87 .58 16.05 10.04 3.29 6.64 8.12 Table 5. The correlations and the RMS differences between the neutral and vowel trajectory segment for each vocal tract (LongVT and ShortVT). 5.3.3 Discussion The main prediction was that the spectral average of the acoustic output of a wide range of articulations would be similar to the spectral average of the neutral vowel. To test this, vowel trajectories that went from /i/ to /ɑ/ to /u/ were synthesized from the Long and ShortVT. They were then spliced at different time points in the full stimulus (100 msec, 300 msec, 600 msec, and the full – 1.1 sec) to create four stimuli for each vocal tract. Each of these was compared to the neutral vowel, and it was predicted that the longer stimuli would be more similar to the neutral than the shorter stimuli, since these would have more vowel content. It was also predicted that a speaker’s vowel trajectory should be more closely matched to its neutral vowel than to another speaker’s neutral vowel. Two analyses were done to determine how closely the vowel trajectories resembled the neutral vowel. The first was a correlation analysis to determine if the pattern of amplitude change across frequencies (troughs and peaks) was similar between 128 Figure 31. LPC comparisons of the neutral vowel and the vowel trajectory segments synthesized from the LongVT (Vtraj = vowel trajectory; Vtraj1 = 100 ms; Vtraj3 = 300 ms; Vtraj6 = 600 ms; Vtraj 11 = 1.1 sec). 129 Figure 32. LPC comparisons of the neutral vowel and the vowel trajectory segments synthesized from the ShortVT (Vtraj = vowel trajectory; Vtraj1 = 100 ms; Vtraj3 = 300 ms; Vtraj6 = 600 ms; Vtraj 11 = 1.1 sec). 130 Figure 33. LPC comparisons of the neutral vowel of one TubeTalker compared to the full vowel trajectory of the other TubeTalker (Vtraj= vowel trajectory; Vtraj11 = 1.1 sec). the two. The LongVT and ShortVT had the same pattern of results. The correlation became more positively correlated as the duration of the vowel trajectory became longer up to 600 msec. For both vocal tracts, the neutral vowel was highly correlated with the 600 msec portion of the vowel trajectory. This correlation fell slightly for the full vowel trajectory. The same pattern was found for the differences between the two spectra (RMS difference) for the ShortVT. The difference gets smaller as the vowel trajectory gets longer up to 600 msec but then gets larger for the full vowel trajectory. On the other hand, the RMS difference for the LongVT continually gets smaller and is smallest for the 1.1 sec, full vowel trajectory. From these synthesized stimuli, it does not look like a sampling of the vowel space is reflective of the neutral vowel. While they get similar in both the pattern of spectral changes (correlation) and the difference between the two 131 spectra (RMS) up to 600 msec, the additional 500 msec of acoustic information actually shows that the spectra of the neutral and vowel trajectory start to move away from one another in all but one condition (LongVT – RMS). Even in this condition, the RMS value only improves by less than a point. Further, the analysis that compared the LongVT neutral to the ShortVT full vowel trajectory showed that the LongVT was more highly correlated with and had the smallest RMS value for this comparison than any of the ones with its own synthesized vowels. This goes against the prediction that the average acoustic output of one talker is similar to that talker’s average neutral vowel. The other comparison that looked at the ShortVT neutral compared to the LongVT full vowel did not show that it was more highly correlated with the LongVT full vowel trajectory than its own full vowel trajectory. Since the mean vocal tract shape is the average of vowel vocal tract shapes, it was thought that the average of the acoustics of a speaker’s vowels would resemble the output of the mean vocal tract shape – the neutral vowel. One explanation for these results is that the vowel trajectories continuously changed formant values across a broad range of formants. This would flatten the spectral average, making the spectral peaks less prominent. These analyses also run into the same problems as the previous experiment. It may be that correlations and RMS values are not sensitive enough to determine how closely the two are matched. Additionally, only the region surrounding the first formant was analyzed. The results may have been different had a wider frequency range been investigated instead. 132 5.4 Experiment 6 If a sampling of the entire vowel space can give listeners an estimate of the neutral vocal tract, then one would expect that the vowel trajectory or the neutral vowel from the same speaker would result in the same shift in perception of a target series (Prediction #3). This experiment uses the neutral vowels and the full vowel trajectories from Experiment 5 to run a perceptual study that tests this prediction. 5.4.1 Methods 5.4.1.1 Stimuli and Predictions Carrier Phrases. The neutral vowel stimuli and the full vowel trajectories for the Long and Short VT from Experiment 5 were used as the carrier phrases for this experiment. Targets. An eight-step “bet” to “bat” target series from the StandardVT was used for this experiment. It was created in the same way as the “bit” to “bet” series (equal changes in articulatory steps). With these specific LTAS comparisons, strong shifts were not predicted with these carrier phrases and the “bit” to “bet” series, but comparisons between the carrier phrases and the “bet” to “bat” series predicted a shift (See Figure 34). The targets were appended to each of the carrier phrases after a 50-msec inter-stimulus interval. This created a total of 32 stimuli (2 talkers x 2 carrier phrases x 8 targets). Predictions. If listeners are extracting the neutral from the vowel trajectories, then the same shift should be found between talkers (ShortVT and LongVT), irrespective of whether listeners hear the vowel trajectory or neutral vowel as the carrier phrase. If listeners do not extract the neutral vowel from the trajectory, then the results should be 133 Figure 34. LPC comparisons of the ShortVT, LongVT, and ambiguous target with both the synthesized neutral vowel (A) and the vowel trajectory (B). strictly dependent on the LTAS comparisons. The LTAS comparisons (Figure 34) show that there should be no difference in target perception between talkers when the vowel trajectories are used as the carrier phrase. However, when the neutral vowels are the carrier phrases, there should be significantly more “bet” responses for the ShortVT (“bet” has a lower F1 than “bat” and the ShortVT’s peak is higher in frequency, which would make the F1 seem lower and more “bet”-like). 5.4.1.2 Participants and Procedure Fourteen participants were run in this study. The stimuli were split into two blocks, separated by carrier phrase (Block 1 = Neutrals, Block 2 = Vowel Trajectories). There were five repetitions per block and participants ran in each block twice (order 134 counterbalanced). This made a total of 320 trials (2 phrases x 2 talkers x 8 targets x 10 repetitions). 5.4.2 Results The comparison between the neutral vowels for the LongVT and ShortVT showed a significant difference between the two with ShortVT eliciting more “bet” responses (t(13)=-4.62, p<.005). No significant difference was found between the LongVT and ShortVT for the vowel trajectory carriers (t(13)=.35, p>.05). Figure 35 shows the results graphically. Figure 35. Results for Experiments 5: Mean percent “bet” responses for neutral vowels and vowel trajectories synthesized from the ShortVT and LongVT. Error bars indicate standard error of the mean. 5.4.3 Discussion The results of the perceptual study do not support the hypothesis that listeners are extracting an average that, based on the LTAS paradigm, reflects the neutral vowel and using it as a referent from which they make perceptual judgments. There was no 135 significant difference between the talkers when the vowel trajectories were used as carrier phrases, but there was a significant difference between them when the neutral vowel was used as the carrier phrase. Listeners’ ability to identify a sound seems to be based directly on the LTAS itself rather than the extrapolation of a neutral vowel. It is not essential to the model that the average of the spectral energy reflects the neutral vowel. This experiment suggests that listeners may be solely using the average. However, since the vowel trajectories do cover a broad range of formant values (as mentioned previously), it could also be that these stimuli did not have prominent spectral peaks to elicit an effect of talker. Interestingly, there was a shift between the neutral vowels of the LongVT and ShortVT. These neutral vowels were static in nature. Earlier, it was suggested that stimuli that do not vary spectrally or temporally may not elicit spectral contrast effects. These results run counter to that claim. These stimuli were longer than the vowels used earlier (1.1 seconds vs. 400 msec in Preliminary Experiment 4 and 250 msec in Experiment 3). It could be that listeners need more than brief exposure to a sound to calculate its spectral average and then, use it as a referent. This needs to be tested in future experiments. 5.5 General Discussion The aim of these three experiments was to determine whether or not the LTAS of a talker reflects the output of that talker’s neutral vowel. This idea led to three specific predictions: 136 (1) If the neutral vowel is extracted through the additive spectral content of a specific talker’s speech productions, then the LTAS of running speech should resemble the spectrum of the neutral vowel. (2) The more the talker moves around the articulatory space, the more likely the LTAS will come to resemble the neutral vowel. (3) If the neutral vowel is extracted, then, it should lead to the same quantitative shift in target responses as the talker’s speech productions. These predictions were not supported by the results of Experiments 4, 5, and 6. Experiment 4 tested the first prediction. Real speech productions (sentences appended to one another) were used to investigate if running speech reflected the neutral vowel. It was expected that the more acoustic information available (the more segments appended together), the more similar the LTAS would be to the neutral vowel. However, the correlations were not very high. The largest increase in acoustic information was from Sent6 to Sent11. Averaged across subjects, this added almost 12 seconds of speech content; yet, this did not consistently improve the correlation between the neutral and the speech segments. Even if it did improve, it was not by much if compared to Sent6 (the biggest improvement from Sent6 to Sent11 was for Female1 by .07). The weak correlations may have been because the entire phrase was included in the analysis. Meaning, “noisy” speech sounds and silent pauses were included in the LTAS calculation, which would have flattened the spectrum. If these sounds and silent gaps were taken out, the spectral peaks may have been more prominent and the correlation between the running speech and the neutral vowel may have been stronger. It also could 137 be that other measures may be more sensitive to these types of comparisons and should be used in future investigations. Experiment 5 used more controlled stimuli that provided a sample of the entire vowel space for two TubeTalkers (Long and Short VT). The comparisons between these “vowel trajectories” and the neutral for each TubeTalker were meant to test Prediction 2: The more the talker moves around the articulatory space, the more likely the LTAS will come to resemble the neutral vowel. The vowel trajectories were compared with the neutral vowel for the vocal tracts. They were also spliced into segments that went from a short duration or sample of the vowel space to the full vowel trajectory of the entire vowel space. These also did not show a benefit of adding more acoustic information to the LTAS, having the same issues as Experiment 4. These measures may not have been sensitive enough to the changes and the stimuli used may need to be changed. The vowel trajectories’ formants were constantly transitioning through a wide range of formant values. This flattened the spectrum so that the peaks were not very prominent. This would explain why the correlations were so weak and also why the vowel trajectories did not elicit changes in target perception in Experiment 6. Experiment 6 tested the third hypothesis: If the neutral vowel is extracted, then, it should lead to the same quantitative shift in target responses as the talker’s speech production. This perceptual study did not support the idea of neutral extraction. If it had, the perceptual differences of the target should have remained constant between talkers, irrespective of whether it followed the neutral or vowel trajectory of the same VT. The pattern of results did not follow this prediction. When the target followed the neutral 138 vowels, there were significantly more “bet” responses for the ShortVT. When the target followed the vowel trajectories, there was no significant difference between the two. Again, this is possibly due to the fact that the vowel trajectories had very weak spectral peaks, which would not have caused a contrastive effect. This would be predicted by the LTAS comparisons of the carrier phrases and the ambiguous target. That is, an LTAS account that does not include the extraction of the neutral vowel but is based only on the spectral average itself would have predicted that the neutral vowels would have elicited an effect on target perception and the vowel trajectories would not have. On the other hand, listeners could be extracting the neutral vowel or the neutral vocal tract shape, but if they are, it is not through the simple spectral average that the LTAS model suggests. Further investigation is necessary to tease these two ideas apart. 139 CHAPTER 6 HARMONIC COMPLEX EFFECTS Specific Aim #3: Determine what changes in the spectral content of the LTAS elicit a perceptual shift on the target. Hypothesis #3: Listeners represent target vowels relative to the LTAS of the carrier. 6.1 Harmonic Complex Effects As mentioned previously, the predictions from the LTAS model have been rather coarse. In general, they have been made by looking at the relative frequency position of peaks in the LTAS and comparing them to a particular formant peak frequency in the ambiguous vowel. Specifically, if two criteria are met an effect has been predicted. These are: (1) that the spectral energy of the carrier phrase has a trough (low energy) in the region of the formant of interest and (2) that the energy from the trough gradually increases and peaks higher or lower in frequency than the formant of interest. For the most part, these two criteria have been successful in predicting LTAS effects. However, it is necessary to develop a model that can make specific, quantitative predictions that are based on knowledge of the LTAS mechanism, rather than on “eye-balling” the LTAS comparisons. For instance, it is not clear how close in frequency the peak of the carrier phrase and the target’s formant need to be for the carrier phrase to have an effect on the target’s perception. It could be that outside of a certain frequency range of the target’s formant, the carrier no longer has an effect, even if it does have a peak higher or lower in 140 frequency. It is also unknown whether the peak has to have a certain amplitude to affect the target. It could be that the peak needs to be at least equal to that of the target’s formant peak. Further, the role of the troughs or regions of low amplitude have not been given much consideration. The shifts in perception have been made solely on the peak location up to this point. It may be that troughs have an effect, as well. To examine these possibilities so that the LTAS model is capable of making more specific predictions, these next experiments parametrically explore how previous spectral energy affects target responses. 6.2 Experiment 7 This first experiment sought to determine how close a peak in frequency needs to be to the formant of a target to have an effect on that target’s perception. Seven harmonic complexes were created. Each complex had a different peak in frequency that moved in 100 Hz steps from 300 to 900 Hz. The ambiguous target from the “bit” to “bet” series (bVt5) was used. The peak of the complexes was matched in amplitude to the peak of the first formant of the ambiguous target. It was expected that the harmonic complexes with peaks closest to the first formant would cause a shift in target perception, while the ones that were further away would not. Again, these are all very coarse predictions with no true standard for what is considered “further away” or “closer”. It was hoped that this experiment would start to refine these into true definable, measures. 141 6.2.1 Methods 6.2.1.1 Stimuli and Predictions Carrier Phrases. Seven harmonic complexes were created in MATLAB. They were 500 msec in length with a 50 msec amplitude ramp at the beginning and end of the stimulus. Each harmonic was a multiple of 100 Hz and the complex was composed of 12 harmonics (100-1200 Hz) that had one peak in frequency. The peak in frequency changed from 300 Hz to 900 Hz. The steepness of the peak was determined by measuring the change in amplitude between the harmonics in the first formant (downslope) of the ambiguous target of the “bit”-“bet” series. The peak harmonic was matched in amplitude to the amplitude of the first formant in the ambiguous target. The harmonics immediately lower and higher in frequency were 10 dB lower and the rest of the harmonics were 15 dB lower than the highest harmonic. Figure 36 shows the spectra for three of the harmonic complexes (HC300, HC600, HC900) and the ambiguous target. Targets. The “bit”-“bet” series (11 steps) were appended to the end of the carrier phrases after a 50-msec ISI (77 stimuli total). Predictions. If the peak in the LTAS needs to be within a certain frequency range of a target’s formant to have an effect on perception, then peaks outside that range should not have an effect. The harmonic complexes were created in an attempt to determine what this range is. The harmonic complexes within this range should shift the target’s perception. In the case of this particular ambiguous target (bVt5), the region around the first formant is the region of interest, since the first formant is important for the 142 Figure 36. The spectra of three of the harmonic complexs and the ambiguous target. distinction between a “bit” and a “bet”. If the harmonic complex peaks higher in frequency, the first formant of the ambiguous target should be perceived as having a lower frequency and thus, get more “bit” responses. If the harmonic complex peaks lower in frequency, then the opposite should happen and it should get more “bet” responses. The harmonic complexes outside the range should not have an effect on the target. 6.2.1.2 Participants and Procedure Twenty participants were run in this experiment. The stimuli were not split up into separate blocks, but played in the same block. Each stimulus was repeated 3 times 143 per block and participants ran in this block for times for a total of 924 trials (7 harmonic complexes x 11 targets x 12 repetitions.) 6.2.2 Results A one-way ANOVA was used to test for differences in percent “bit” responses at seven different frequency peaks. The percentage of “bit” responses was determined in the same way as previous experiments (averaged across repetitions of the five middle members of the target series). The results were not significantly different across the seven different frequencies (F(6,114)=1.41, p>.05). Figure 37 shows the results graphically. Figure 38 shows the individual data, as well as the average for each condition. Figure 37. Results for Experiment 7: Mean percent “bit” responses for seven different harmonic complexes. Error bars indicate standard error of the mean. 144 6.2.3 Discussion Unexpectedly, the ANOVA showed no significant differences between harmonic complexes. Looking at Figure 38, it is apparent that there is a large amount of variation between subjects. One possible reason for this variation could be that these stimuli were able to distinguish between listeners’ boundary between the vowel /ɪ/ and /ɛ/. That is, the ambiguous target is not always the same across listeners. For some, it could be an earlier target in the series and for others, it could be one that is later. That means the same harmonic complex could have the opposite effect on two listeners’ perception if their ambiguous targets were not the same. This may not have affected the results of previous experiments, since there were typically only two different carrier phrases being compared – so only two frequency peaks were of concern when being compared to the target’s first formant. This would have less of a chance of washing out any effects, because only the listeners whose ambiguous target was near one of those peaks may have been affected. In this case, the peaks were every 100 Hz so more discriminable differences between listeners’ ambiguous target could have caused any effects to disappear. For this reason, two more ANOVAs were run. The average of percent “bit” responses across all conditions was found (grand mean) and the averages for each subject across all conditions were determined. If the subject’s average was higher than the grand mean, they were put into one group and if it was lower, they were put into another group. This was thought to separate the participants by whether they had a lower or higher ambiguous target. These separate ANOVAs did not find a significant interaction, either (above grand mean: F(6,42)=1.94, p>.05 ; below grand mean: F(6,66) = .49, p>.05). The 145 Figure 38. Experiment 7 individual differences for each harmonic complex. individual differences seem like they are quite complex and cannot simply be accounted for by dividing the participants into two groups. Another possibility is that seven carrier phrases were, in some sense, overwhelming to the participant. Not knowing how the LTAS mechanism works, it could have “tired” the auditory system by stimulating the same frequencies within a certain region continuously for up to an hour. As mentioned previously in Chapter 4, the auditory system could have also segregated the harmonic complexes and identified them as another auditory event than they did the target. If so, then the harmonic complexes may not have been able to affect the target. 146 6.3 Experiment 8 The purpose of this experiment was to determine if the peak of the amplitude had an effect on target perception. It could be that the higher peak has more of an effect than a smaller peak. Harmonic complexes were again created. This time, there were five peaks, but each peak had two possible amplitude levels. These were played before the target series to see if the harmonic complexes with the higher amplitude had a larger effect on the target’s perception. 6.3.1 Methods 6.3.1.1 Stimuli and Predictions Carrier phrases. As in Experiment 7, the carrier phrases were harmonic complexes that consisted of 12 harmonics. They were also 500 msec in length with a 50 msec amplitude ramp on the beginning and end of each stimulus. The peak varied in 100 Hz steps from 300 Hz to 700 Hz. There were two different peak amplitudes. The first amplitude was the same as used in Experiment 7. The center or peak harmonic was matched to the target. The harmonics immediately lower and higher in frequency were set 10dB lower than that. The rest of the harmonics were 15dB lower than the center harmonic. The second amplitude setting added 15dB to the center harmonic and the harmonic immediately lower and higher in frequency. The rest of the harmonics stayed the same. This created a total of 10 carrier phrases (5 peak frequencies x 2 amplitude levels). Figure 39 shows the spectra of one of the harmonic complexes when it has its matched peak and its high amplitude peak. 147 Targets. The “bit”-“bet” series were appended to the end of the carrier phrases after a 50-msec ISI (110 stimuli total). Only the first eleven steps of the series were used to ensure that the experiment took no longer than one hour. Figure 39. Spectra of the ambiguous target and the harmonic complex with a high-amplitude peak at 700 Hz in both the matched amplitude condition (top – HC 700) and the high amplitude condition (bottom – HC700-p15). 148 6.3.1.2 Participants and Procedure Thirteen participants were run in this experiment. The stimuli were split into blocks based on peak amplitude, so there were two blocks (Block 1 = Matched Peak, Block 2 = High Peak). There were five repetitions per block and participants ran in each block twice for a total of 1100 trials (5 peaks x 2 amplitudes x 11 targets x 10 repetitions). 6.3.2 Results A 2x5 (2 amplitudes x 5 peak frequencies) ANOVA was used to analyze the data. The main effects of amplitude and peak frequency were not significant (F(1,12) = .12, p>.05; F(4,48) = 16.46, p>.05). No significant interaction was found (F(4,48) =.61, p>.05). Figure 40 demonstrates the results graphically and Figure 41 and 42 show the individual scores for each subject across conditions. Figure 40. Results for Experiments 8: Mean percent “bit” responses for five different harmonic complexes at two different amplitude levels. Error bars indicate standard error of the mean. 149 Figure 41. Experiment 8 individual differences for the matched peak condition. Figure 42. Experiment 8 individual differences for the high peak condition. 150 6.3.3 Discussion Again, no significant effect was found. Due to the individual variation, the scores were again split by whether each subject’s average across conditions was above or below the grand mean. This did not result in any significant effects, either (above grand mean: F(4,28)=.82, p>.05; below grand mean: F(4,16)=1.18,p>.05). Possible reasons why there was no effect found will be discussed in the General Discussion. 6.4 Experiment 9 Experiment 8 investigated whether the amplitude of the peak had an effect on target perception. This experiment investigates whether the amplitude of the trough (region of low spectral energy) affects target perception. 6.4.1 Methods 6.4.1.1 Stimuli and Predictions Carrier phrases. These stimuli were created in the same way as those of the previous two experiments (500 msec, 50 msec amplitude ramp). Instead of having peaks, though, these had troughs that’s lowest region varied in 100 Hz steps from 300 Hz to 700 Hz. There were two different trough amplitudes. The first amplitude was meant to mirror the matched peak values. The three harmonics that created the trough had a lower value than the rest of the harmonics (the center of the trough was 15dB lower and the harmonic immediately higher and lower in frequency was 10dB lower. The second amplitude setting subtracted 15dB from the center harmonic and the harmonic immediately lower and higher in frequency. The rest of the harmonics stayed the same. 151 This created a total of 10 carrier phrases (5 peak frequencies x 2 amplitude levels). Figure 43 shows the spectra of a subset of the harmonic complexes. Targets. The “bit”-“bet” series (11 steps) were appended to the end of the carrier phrases after a 50-msec ISI (110 stimuli total). Figure 43. Spectra of the ambiguous target and the harmonic complex with a trough at 700 Hz in both the trough (top – HC 700-m) and the large trough condition (bottom – HC700-m15) conditions. 152 6.4.1.2 Participants and Procedure Fourteen participants were run in this experiment. The stimuli were split into blocks based on peak amplitude, so there were two blocks (Block 1 = Small Trough, Block 2 = Large Trough). There were five repetitions per block and participants ran in each block twice for a total of 1100 trials (5 peaks x 2 amplitudes x 11 targets x 10 repetitions). 6.4.2 Results A 2x5 (2 amplitudes x 5 peak frequencies) ANOVA was used to analyze the data. The main effects of amplitude and peak frequency were not significant (F(1,13) = .19, p>.05; F(4,52) = 1.50, p>.05). No significant interaction was found (F(4,52)=2.41, p>.05). Figure 44 shows this graphically. Figure 45 and 46 show the individual scores of all participants across conditions. Figure 44. Results for Experiment 9: Mean percent “bit” responses for five different harmonic complexes at two different trough amplitude levels. Error bars indicate standard error of the mean. 153 Figure 45. Experiment 9 individual differences for the small trough condition. Figure 46. Experiment 9 individual differences for the large trough condition. 154 6.4.3 Discussion No effect was found that provided clues as to how troughs may play a role in target perception. The data was again split into two groups, and again, no significant effect was found for either (above grand mean: F(4,32)=1.58, p>.05; below grand mean: F(4,16)=1.08, p>.05). Reasons why these stimuli may have not elicited any effects are discussed in the General Discussion. 6.5 General Discussion The specific aim of these experiments was to determine what changes in spectral content of the LTAS would elicit a perceptual shift in target. It seemed appropriate to use very simple stimuli (the harmonic complexes) to investigate this, as the peaks and troughs could be easily manipulated relative to the target’s first formant. However, it seems that these stimuli may have been too simple, as no effects were found in any of the three experiments. As mentioned previously (General Discussion of Chapter 4), it may be that since these precursors have a similar structure, they are grouped separately from the target (therefore, having no effect on its perception). The process whereby listeners determine what acoustic information belongs to which auditory event is known as stream segregation (Bregman, 1990). This was thought to be a potential explanation for why the reversed speech of the full phrase “He had a rabbit and a” did not have an effect on target perception. There is also another possible explanation. Watkins (1991) suggested that carrier phrases must include spectro-temporal variation to elicit a change in target perception. These harmonic complexes did not vary, but maintained the same spectral energy for the 500-msec duration. It seems that could also provide and explanation for 155 the null effects found here, but on the other hand, in Chapter 5 – Experiment 6, the neutral vowel elicited a shift in perception as a function of talker, despite having no spectro-temporal variation. It seems that the LTAS mechanism and what elicits effects from it is rather complex. This will be discussed further in Chapter 7. 156 CHAPTER 7 DISCUSSION AND CONCLUSION 7.1 Overview Summary Listeners have an amazing ability to understand a wide range of talkers, despite all of their inherent differences. Talkers vary in their anatomy and physiology, in their dialect/accent, and even in the speaking styles they may use in different social contexts. All of these affect the acoustic output of the particular speaker, and for over fifty years, researchers have tried to figure out how listeners seem to so easily handle this variability. The process by which listeners compensate for these differences has been referred to as “talker normalization.” Specifically, this project was concerned with extrinsic talker normalization (See Chapter 1 for a review of the different theories of talker normalization). The paradigm with which these effects have been studied (first used by Ladefoged and Broadbent, 1957) presents listeners with a series of carrier phrases followed by an ambiguous target speech sound. The carrier phrases are usually manipulated to sound as if they are being produced by different talkers. Listeners are asked to identify the target speech sound at the end and in many cases, when the talker changes, listeners’ perception shifts, as well. This is taken as evidence for talker normalization – that listeners’ perception is affected by differences between talkers. The previous models and theories of extrinsic talker normalization have had one thing in common: the proposal that listeners are learning something specific about the talker during the initial carrier phrase. 157 Whether it is that the listener is creating an acoustic-phonemic map or that the listener is extracting anatomical/physiological information about a talker’s vocal tract and articulatory patterns, these models presume that the listener is extracting fairly specific information about the talker. However, these theories have not always been able to account for the results across all experiments. Particularly, a substantial change in talker has not always led to an effect on the target (Darwin, McKeown, and Kirby, 1989; Dehcovitz, 1997b). The purpose of this project was to develop and test an alternative model of extrinsic talker normalization. This model was motivated by two previous findings. The first was that listeners’ perception of a target speech sound could be shifted from one phoneme to another, depending on whether a series of tones that preceded it had a high or low average frequency (Holt, 2005; 2006). If the tone sequence had a higher average than the frequency region important for that target speech sound’s identification (typically, a formant), then the formant would be perceived as lower in frequency and be identified as the speech sound with the lower formant (Lotto & Holt, 2006). The current model extends this spectral contrast effect to try to account for the effects of carrier phrases on target perception. The model suggested that listeners could be calculating an average across the carrier phrase – in this case, a Long Term Average Spectrum or LTAS – and using this as a referent for identification of ambiguous target sounds. If a peak in the LTAS was close to the important frequency region (or formant) of a speech sound, then it would be predicted to have a contrastive effect on that sound’s perception. However, if none of the peaks in the LTAS were close to important frequency regions, 158 the carrier phrase would not have an effect on the target’s perception. By making LTAS comparisons between the carrier phrase and the target, one could not only predict that the carrier would have an effect but also the particular effect it would have. The second finding that influenced the model was based on analyses of vocal tract area functions that came from MRI images (Story, 2005b). These analyses showed that across different talkers’ vowel productions, most of the variability between talkers could be captured by their neutral vocal tract shape or average shape across all of the vowels. Based on this result, it is tempting to speculate that listeners could accommodate individual talker differences by estimating this neutral vocal tract shape. Such a referent (a cross-sectional area of a vocal tract air space) would only be useful if one presumed a model of talker normalization that was based on an internal vocal tract model (perhaps through Analysis-by-Synthesis). Another possibility is that the acoustic output of a neutral vocal tract – the neutral vowel – would also provide some normalization of talker differences when used as a referent. The final theoretical leap is that the LTAS of a talker’s speech may, in fact, reflect the spectral characteristics of the neutral vowel. This would be because the speech produced would be the output of a variety of vocal tract shapes and the neutral vowel is essentially the output of the average vocal tract shape. It would seem plausible that the average of the output of a variety of vocal tract shapes could resemble the neutral vowel eventually. The LTAS being calculated and used as a referent, then, could be similar in spectral content to the neutral vowel. Based on these two findings and their theoretical extensions, an LTAS model of talker normalization was developed, which had four main hypotheses: 1) during carrier 159 phrases, listeners compute an LTAS for the talker; 2) this LTAS resembles the spectrum of the neutral vowel; 3) listeners represent subsequent targets relative to this LTAS referent; and 4) such a representation reduces talker-specific variability. From the results of three preliminary experiments (Chapter 3), it was noted that the predictions from the LTAS were not very precise. Predictions were the result of rough comparisons of the LTAS and target spectra. It would be preferable to have a model that could make precise, quantitative predictions. The preliminary studies did not always support predictions of the basic LTAS model. Specifically, the pattern of expected results was not obtained when the carrier phrases were played backward. This led to the idea that the LTAS may be calculated over a specific time duration, not over an entire phrase. Together, with the idea of the neutral vocal tract, this led to three specific aims for the current project: Specific Aim #1: Determine the temporal window over which the LTAS is calculated. (Chapter 4 – Experiments 1, 2, 3) Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract. (Chapter 5 – Experiments 4, 5, 6) Specific Aim #3: Determine what aspects of the comparisons of the LTAS and the target predict shifts in target categorization. (Chapter 6 – Experiments 7, 8, 9) Below, I give a short summary of the results of the experiments associated with each aim followed by an interpretation of the results in light of the model. Future experiments are proposed in each section based on questions raised by the results. 160 7.2 Specific Aim #1: Determine the temporal window over which the LTAS is calculated. Preliminary Experiment 3 found an effect of talker when stimuli were played forward but not when they were played backward. This was a challenge for the LTAS model, as it would predict that contexts with the same LTAS (like a signal played forward or backward) should have the same effect on listeners’ perception of the target. The lack of a backward effect led to the prediction that there may be a temporal window over which the LTAS is calculated. That is, it may be that only some duration of the acoustic information closest to the target will affect that target’s perception. As the forward and backward contexts have different information near the target (the backward actually having the acoustics from the beginning of the carrier phrase), this could explain the discrepancy in the results. The purpose of the first two experiments, then, was to determine whether a temporal window could explain the previous results using a gated presentation. The third experiment used a complementary paradigm (adding information instead of subtracting) to try to provide an estimate of the duration of such a window. The first two experiments used the same stimuli from Preliminary Experiment 3 (carrier phrase “He had a rabbit and a” played both forward and backward). These stimuli were spliced to create a stimulus set that had different durations (100 msec, 325 msec, and 1100 msec). Carrier phrases containing the same content in the critical temporal window were expected to have the same effect regardless of the content outside of the window. The results from these experiments, however, did not show a definitive time window. Different patterns of effects were obtained for each gating duration. The 161 fact that significant effects for backward presentation were obtained for 325-msec durations but not for the full phrase suggests that if there is a window of integration for LTAS computation, it is longer than 325 msec. Experiment 3 was designed to examine the effects of context duration on the computation of the LTAS by appending stimuli with well-described spectral characteristics (steady-state vowels). By measuring the effects of appending additional vowels, it was proposed that one could determine the extent of the temporal window and even perhaps the weighting of the information as one moves distally from the target. However, none of the manipulations resulted in changes in target perception. 7.2.1 Interpretation and Future Research The finding that the full-phrase “He had a rabbit and a” results in an effect of talker when it was played forward but not when it was played backward was a consistent result across multiple experiments (Preliminary Experiment 3, Experiment 1, and Experiment 2). This is a challenge to the most-simple LTAS-based model in which the LTAS is computed across the entire context. Neither of the shorter gating durations examined resulted in a pattern of effects that would indicate a temporal window for LTAS computation. One possible interpretation is that one cannot obtain context effects for reversed speech as it does not contain phonetic information. Whereas this possibility is discussed further in section 7.5, it is unlikely true in its strongest form since backward effects have been previously demonstrated (Watkins, 1991) and a backward effect was obtained for the 325-msec gating window (and in Preliminary Study 4). Another possibility is that the temporal window over which the LTAS is computed is longer than 162 325 msec but shorter than the overall phrase duration (1100 msec) and that within this window the LTAS of the forward and backward contexts differ. Future studies will explore longer temporal windows. However, a new phrase should be used to verify that there is a difference between backward and forward stimuli. Further, because the phrase used contains “bit” in the carrier (“rabbit”), this could be a problem for using this phrase for tests of longer temporal windows. For example, splicing the phrase 750 msec from the beginning results in: “He had a rab”. This would be a problem when the targets consist of a “bit” to “bet” series. Lexical information could skew listeners to hear more “bit” responses as this would create a word. Another phrase should then be used to avoid lexical biases and to confirm that there are differences between forward and backward phrases. The lack of effect of the appended vowel contexts in Experiment 3 provides no insight into the estimation of a temporal window, but it does demonstrate a constraint on the normalization mechanism. It is clear that some contexts elicit no effects on target perception despite having appropriate phonetic and spectral content. Suggestions on what underlies these constraints are developed further in section 7.5. 7.3 Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract. The second specific aim was to determine whether or not the LTAS reflects the neutral vowel. Experiment 4 compared the LTAS of running (real) speech to the spectrum of the neutral vowel for four different talkers (two female, two male). It was predicted that the more running speech that was included in the calculation, the more the 163 LTAS would resemble the neutral vowel. As this was the first investigation of this proposal, the computations and comparisons were performed with the fewest assumptions possible – the envelope of the LTAS computed across entire sentences were correlated with the envelope of the spectrum of the neutral vowel. The resulting correlations were not convincingly high. One could easily argue that using a correlational analysis was a rudimentary way of investigating this idea and other methods could better describe the relationship between the running speech and the neutral vowel. Also, the correlations compared the frequency region from 0-4000 Hz. If a smaller region had been analyzed (for example, only the region that included the first three formants), the correlations may have been much stronger. Perhaps most importantly, LTAS was computed across the entire sentence without prejudice. However, the running speech consisted of sentences that included an array of different speech sounds, including fricatives and stops as well as silent gaps. These types of sounds likely flatten the spectrum at lower frequencies near the first two formants. It could be that if these sounds were taken out, the LTAS of the running speech would include more distinct resonances that would match the neutral vowel. In Experiment 5, synthesized stimuli from both the LongVT and the ShortVT were used to test whether the acoustic output from a speaker could resemble that speaker’s neutral vowel. The stimulus that was compared to the neutral vowel for each TubeTalker was a vowel trajectory. The vowel trajectory was the output of vocal tract movement from /i/ to /ɑ/ to /u/. It was predicted that including a greater sampling of the vowel space would result in an LTAS that would come to resemble the neutral vowel. 164 The results of the LTAS comparisons between the vowel trajectory stimuli and the neutral vowels did not show that the more of the articulatory space that was sampled, the closer the output resembled the neutral. One possible reason for the failure to support this prediction is that these signals contained continuous changes in formant values, which will result in a more flattened LTAS representation. As a result, the correlations between LTAS and neutral were actually better when only half the vowel space was sampled. When both halves were included in the computations, the correlations decreased, in part, because the averaging across a broader range of formant frequencies results in less distinct spectral peaks especially in the region of F2. Rather than comparing the LTAS of different speakers, Experiment 6 compared listeners’ perception of a target following either the neutral vowel or the full vowel trajectory of either the LongVT or the ShortVT. If listeners are extracting the neutral vowel, then the pattern of effects across talkers should have been the same whether the carrier was the neutral or the vowel trajectory. However, the pattern of results was not the same. The neutral carrier phrase had a significant effect of talker. The vowel trajectories did not. Again, this difference in perception could result because the continuous vowel trajectories do not result in very distinct LTAS patterns. 7.3.1 Interpretation and Future Research The results of Experiment 4 do not provide strong evidence that the LTAS of running speech resembles the spectrum of the neutral vowel. However, this study included comparisons with the fewest assumptions about how the LTAS is calculated. In particular, the inclusion of silences (e.g. stop closures) and fricative and aspiration noise 165 in the LTAS likely results in a flattened LTAS in the region of interest for the studies conducted (in the region of F1 and F2). An obvious next step is to examine the LTAS for only voiced sections of the speech signal. This is likely to result in more distinct spectral peaks that may be comparable to the resonances of the neutral vocal tract. It should be noted that such a step has implications for the LTAS model of normalization. An additional mechanism for detecting voicing in the signal will have to be proposed to limit LTAS computation to just those segments of the signal. In addition to testing different ways of calculating the LTAS from the signal, the next steps in this project will include the development of new ways to test the match between the LTAS and the neutral vowel. The correlation and RMS methods used in this dissertation are simple and straight-forward, but perhaps too coarse. Future measures could include correlations and distances of estimated peak frequencies. Even with novel metrics, it is unlikely that the LTAS and the neutral vowel will be identical for most speech. An interesting question to follow up on is whether the LTAS provides significant normalization if used as a referent. Vowels spaces can be plotted for multiple speakers both in objective F1-F2 space and in a space with the frequencies of the first two major peaks of the LTAS serving as the origin (reference point). If vowel category dispersion is less in the LTAS-reference space, this would indicate some beneficial normalization even if the neutral vowel itself is not extracted. The results of Experiments 5 and 6 indicate that continuous excursions of the vowel space may not be the best signal for estimating the neutral vowel, because spectral peaks tend to be flattened by the averaging. Future studies will examine how the neutral 166 vowel compares to the LTAS across isolated productions of all of the vowels of English or just the point vowels. It is likely that these vowel spaces will also result in some flattening of the LTAS, which raises the interesting possibility that a more poorly sampled articulatory space (as occurs in real speech) may provide a more effective LTAS – at least in terms of its effects on target perception. This will be tested in future studies pitting vowel space carrier phrases against sentences that are phonetically balanced or imbalanced. The LTAS and neutral vowel acoustic comparisons will be paired with perceptual studies. These studies have the potential of further elucidating why carrier phrase effects occur for some contexts and not others. 7.4 Specific Aim #3: Determine what aspects of the comparisons of the LTAS and the target predict shifts in target categorization. The experiments in Chapter 6 aimed to determine the finer details of spectral contrast mechanism. Specifically, Experiment 7 investigated how close in frequency the peak of the carrier needed to be to the target’s formant to shift perception. Experiment 8 investigated whether the amplitude of the peak mattered and Experiment 9 looked at whether troughs (regions of low amplitude) played a role in perception. It was thought that the best way to get at this would be to use very simple context stimuli whose peaks and troughs could be easily manipulated. Harmonic complexes provided the necessary control while maintaining some similarity to speech stimuli. The set of stimuli utilized in these studies consisted of steady-state sounds constructed from 12 harmonics (multiples of 100 Hz) with different amplitudes. The peaks and troughs in these patterns were based on the target stimuli spectra and included the range of parameters predicted to most likely 167 result in contrast effects. No shifts in target perception were obtained from any of the conditions across the three experiments. 7.4.1 Interpretation and Future Research Before completely discounting the use of harmonic complexes as a preceding carrier in these experiments, it should be noted that methodological changes could result in more sensitive measures. In Experiments 7-9, the stimuli were randomized within each block. Sjerps, Mitterer, and McQueen (2011a) claim that context effects are more robust when blocking by preceding carrier, rather than mixing the presentation of the stimuli within a block. They suggest that when the stimuli are heard together, the preceding carriers may be interacting across trials and that this may weaken effects. If this is the case, then the harmonic complexes could potentially elicit effects if they are separated into different blocks and presented repeatedly, instead of randomized. If these methodological changes do not result in significant context effects, then the complete lack of shifts for these stimuli indicates a very strong constraint on the types of context that will elicit effects on target perception. Previous work has demonstrated large effects of non-speech sounds on speech perception, whether it be sequences of tones (Holt, 2005; 2006) or single frequency-modulated tones (Lotto & Kluender, 1987). These previous stimuli tended to include frequency or amplitude modulation. This may be an important factor for eliciting such effects (as discussed below). 7.5 Additions to the Model Suggested By Current Results The LTAS model presented here assumed that any stimulus could affect the perception of a subsequent target sound if it contained a peak in spectral energy in a 168 region around an important phonetic cue (e.g., a specific formant). However, not all contexts resulted in shifts in target perception in the experiments reported here, challenging the assumptions of the most simple version of the model. These unpredicted results provide an opportunity to develop new hypotheses about the mechanisms involved in talker normalization effects. First, let’s consider the reversed-phrase contexts. One of the most consistent effects across the experiments (Preliminary Experiment 3, Experiment 1, and Experiment 2) was that the forward full phrase “He had a rabbit and a” resulted in more “bit” responses for the ShortVT than for the LongVT. However, when the phrase was played backward, these effects were either smaller or absent. One possible constraint on context effects is that the target and context must be considered part of the same auditory object or stream. Bregman (1990) referred to the perceptual process of grouping acoustic characteristics into coherent objects or streams as Auditory Scene Analysis. According to Bregman’s formulation, sounds only interact with each other when they are part of the same stream. It is more difficult to determine relations (such as temporal order) when the sounds belong to different streams. This provides a possible explanation for the smaller/absent effect for backward contexts versus forward. When “He had a rabbit and a” is presented forward, it is easily identified as speech and would belong to the same auditory event as the following target. When it is played backward, it is not readily identified as speech, but still has a strong acoustic structure that may cause it to be identified as a different auditory event. Importantly, the internal structure of the backward speech would result in a strong cue for formation of its own auditory object 169 separate from the target. In this way, the backward context here would be different from the tone sequences of Holt (2005), which did not have a strong internal structure. As a result, these tones may have not formed a strong separate object and could therefore interact with the speech target resulting in a significant context effect. It should also be noted that the formation of auditory streams takes time (Bregman, 1986), so that shorter contexts may not be as strongly segregated perceptually, which may explain the significant effects obtained for the 325-msec backward contexts in Experiments 2 and 3. If future studies provide more evidence of a strong role for perceptual streaming in carrier phrase effects then it will suggest that the mechanism for talker normalization must follow the processes involved in Auditory Scene Analysis. It should be noted that significant effects have been obtained for reversed contexts in other studies. Watkins (1991) ran a series of experiments – two of which used reversed speech stimuli. Both of these experiments found effects for the reversed speech. His carrier phrases were male and female recordings of the phrase “The next word is” (male: 1 sec; female: 1.07 sec). These carriers were around the same length as the reversed carrier used here (1.3 sec). Sjerps, Mitterer, and McQueen (2011b) ran a study comparing the perception of a target series following normal speech, spectrally-rotated speech, and highly-manipulated speech. The carrier phrase in this case was a Dutch phrase (English translation: “On that book, it doesn’t say the name”) that was synthesized to sound like different talkers by manipulating the F1 value of the phrase. The exact length was not given, but this phrase is clearly of comparable size to Watkins (1991) and the “rabbit” phrase. Similar to reversed speech, the spectrally-rotated speech 170 has strong spectro-temporal characteristics but would no longer be identified as speech. The purpose of the “highly-manipulated” speech was to make it more unlike normal speech. This was done by not only spectrally-rotating the speech but also by flattening the pitch, removing certain amplitude ranges, reversing every syllable, and matching the amplitude across the entire phrase. The spectrally-rotated speech was found to have an effect on target perception, but the highly-manipulated speech did not. Given the number of differences between the contexts and targets used in these studies and the current experiments, it would be difficult to provide a full explanation for the differences in results with any confidence. Future experiments will manipulate the carrier phrases used to examine how some of these difference (for example, having a voiced segment immediately preceding the context) may affect the outcome. The other contexts that did not cause a perceptual shift in target identification were the vowel repetitions of Experiment 3 (examining whether there was a weighted temporal window for the LTAS) and the harmonic complexes of Experiment 7, 8, and 9. Vowel repetitions should not be at risk of being segregated separately from a target speech sound, as they should be identified as speech. Watkins (1991) also ran an experiment that used vowels. Each vowel sequence consisted of the same vowel (200 msec) repeated four times (the vowel was either /ɪ/, /ɛ/, or the ambiguous vowel in between the two). The vowels used in Experiment 3 were 250 msec length and the sequences were some combination of /i/ or /a/, extending from one to three vowels. In this case, Watkins (1991) was unable to find an effect. He suggested that contrastive effects require that the carrier has time-varying spectro-temporal properties, which static, 171 synthesized vowels do not. While this seems plausible, Experiment 6 did find an effect when static vowels were used. The neutral vowels were 1100 msec in length, synthesized from the Long and Short VT (ShortVT resulted in more “bet” responses). Static vowels, then, are capable of causing shifts in perception. The harmonic complexes used in Chapter 6 could have both the problem of being segregated into a separate auditory object and of not having spectro-temporal changes. It is interesting to note, though, that Holt (2005;2006) and Laing, Liu, Lotto, and Holt (2012) found effects when the carrier phrases they used were series of sine wave tones with high- or low-mean averages. In Laing, et al. (2012), these non-speech carriers consisted of seventeen 70-msec tones with 30-msec ISIs (1700-msec total). These simple tones were able to cause an effect but not the harmonic complexes. This is curious as the harmonic complexes are only slightly more “complex.” Previous work has suggested that spectro-temporal variation is essential for obtaining context effects (Watkins, 1991; Sjerps, Mitterer, and McQueen, 2011b). This cannot be completely true as context effects can be obtained for the neutral vowels. However, it is possible that frequency or amplitude modulation enhances the context effects. The auditory system is very sensitive to modulations and spectro-temporal variability may result in more robust representations of the context. Future studies will examine the influence of adding frequency and/or amplitude modulations. If it is the case that spectro-temporal variability enhances talker normalization effects, this will have major implications for the LTAS model. It will be necessary to determine whether the modulation enhances LTAS computation, reduces perceptual stream segregation, or enhances the mechanism of contrast. 172 7.6 Conclusion The simple LTAS model of extrinsic talker normalization was based on a set of findings in perception and speech production. The first model was based on the most basic assumptions of how the LTAS is calculated, how it relates to the neutral vowel and how contrast effects are predicted. The results of nine experiments demonstrate that many additions will have to be added to the basic model to make it viable. In particular, it is quite possible that LTAS computation occurs over some finite temporal window and only on segments that are voiced. The spectral contrast mechanism that changes target perception relative to the LTAS likely follows perceptual grouping (Auditory Scene Analysis) and may be strongly affected by the spectro-temporal variability of the context. The continued research based on the results of these experiments is likely to provide insight both into general auditory processes for complex sounds and into adaptive processes in speech perception. 173 APPENDIX A. AMBIGUOUS TARGET SELECTION The ambiguous targets for the “bit” to “bet” to “bet” series was determined in a brief pilot study. The study consisted of three participants. Each participant listened to the 20-step target series (described in Ch. 3, Section 3.2.2) that changed perceptually from “bit” to “bet” to “bat”. They then verbally responded as to which member of the series was the ambiguous target and to where the target series began and stopped. They were allowed to listen to the target series as many times as they needed to make a decision. Since these three participants were all highly familiar with listening to and creating experiments with the carrier phrase + target paradigm, their experience was deemed sufficient to make these judgments. The stimuli were chosen by the majority. For example, one participant thought bVt6 was the ambiguous target in the “bit” to “bet” series, but the other two thought it was bVt5, so bVt5 was chosen as the ambiguous target. From these results, the target “bit” to “bet” series consisted of the first 11 stimuli from the series (bVt1-bVt11). The ambiguous was judged as the fifth member (bVt5). The “bet” to “bat” series consisted of 8 steps (bVt10-bVt17). The ambiguous was judged to be the fourth member (bVt13). 174 APPENDIX B. FILTERED EXPERIMENTS To make LTAS predictions with the “bit” to “bet” to “bat” series, it was necessary to determine which formant(s) played a role in the perception of the vowel. That is, which formants were susceptible to spectral contrast effects. In particular, it was tested whether the first formant carried the effect or if the second formant (and higher) also played a role. To test this, neutral vowels synthesized from the LongVT and ShortVT were presented to listeners either with no filter, with a high-pass filter, or with a low-pass filter. Cut-off for the high-pass and low-pass filters was 1200 Hz. These served as the carrier phrases and were followed by the 8-step “bet” to “bat” series. Eighteen participants ran in this study. Listeners heard 10 repetitions of each stimulus for a total of 480 trials (2 vocal tracts x 1 phrase x 3 filters x 8 targets x 10 repetitions). Results showed that the ShortVT resulted in significantly more “bet” responses (t(17) = -5.95, p< .001), as did the low-pass condition (t(17) = -8.53, p<.001). The high-pass condition showed no significant difference between the two phrases (t(17) = 1.01, p>.05). Since the low-pass filtered stimuli (having the first formant) resulted in a significant difference, it suggests that the first formant is the cue that is important for the “bet/bat” distinction and is susceptible to spectral contrast effects. It does not seem that the higher formants play a role, since no effect was found for the high-passed filtered stimuli. Since the “bet/bat” targets were part of the extended series that included “bit” to “bet”, it was assumed that these results could generalize to the other end of the series, as well. Also, previous literature has only focused on manipulating the first formant when creating 175 series that use the vowels /ɪ/ and /ɛ/, while keeping the other formants constant (Darwin, McKeown, & Kirby, 1989; Watkins, 1991; Watkins & Makin, 1994; Sjerps, Mitterer, & McQueen, 2011a; 2011b). This provides further evidence that the first formant is the important cue for the “bit/bet” distinction. Therefore, the LTAS comparisons used to make predictions focus on the first formant region of the ambiguous target of the series. 176 REFERENCES Ainsworth, W. A. (1972). Duration as a cue in the recognition of synthetic vowels. The Journal of the Acoustical Society of America, 51(2B), 648–651. Ainsworth, W. A. (1974). The influence of precursive sequences on the perception of synthesized vowels. Language and Speech, 17(2), 103–109. Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant & M. Tatham (Eds.), Auditory analysis and perception of speech, (pp.105-113). London: Academic Press. Assmann, P. F., Nearey, T. M., & Hogan, J. T. (1982). Vowel identification: Orthographic, perceptual, and acoustic aspects. The Journal of the Acoustical Society of America, 71(4), 975-989. Balo, S., et al. (1992-2004). Adobe Audition 1.5. [computer software]. San Jose, CA: Adobe Systems. Blumstein, S. E., Stevens, K. N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. The Journal of the Acoustical Society of America, 66(4), 1001–1017. Blumstein, S. E., & Stevens, K. N. (1980). Perceptual invariance and onset spectra for stop consonants in different vowel environments. The Journal of the Acoustical Society of America, 67(2), 648–662. Blumstein, S.E., & Stevens, K. N. (1981). Phonetic features and acoustic invariance in speech. Cognition, 10(1–3), 25–32. Bradlow, A. R., Kraus, N., & Hayes, E. (2003). Speaking clearly for children with learning disabilities: Sentence perception in noise. Journal of Speech, Language and Hearing Research, 46(1), 80-97. Bregman, A.S. (1990). Auditory scene analysis: The perceptual organization of sounds. Cambridge, MA: MIT Press. Broadbent, D. E. & Ladefoged, P. (1960). Vowel judgements and adaptation level. Proceedings of the Royal Society of London. Series B. Biological Sciences, 151(944), 384-399. Byrne, D. (1977). The speech spectrum-some aspects of its significance for hearing aid selection and evaluation. British Journal of Audiology, 11(2), 40-46. 177 Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R., ... & Ludvigsen, C. (1994). An international comparison of long‐term average speech spectra. The Journal of the Acoustical Society of America, 96(4), 2108-2120. Chen, Y. (2008). The acoustic realization of vowels of Shanghai Chinese. Journal of Phonetics, 36(4), 629-648. Chevreul, M.E. (1839). De la loi du contraste simultane des couleurs. Paris: PitoisLevereault. Chiba, T. and Kajiyama, M. (1958). The Vowel, Its Nature and Structure. Phonetic Society of Japan. Cleveland, T. F., Sundberg, J., & Stone, R. E. (2001). Long-term-average spectrum characteristics of country singers during speaking and singing. Journal of Voice, 15(1), 54-60. Clopper, C. G., Pisoni, D. B., & Jong, K. de. (2005). Acoustic characteristics of the vowel systems of six regional varieties of American English. The Journal of the Acoustical Society of America, 118(3), 1661–1676. Cornelisse, L. E., Gagnńe, J. P., & Seewald, R. C. (1991). Ear level recordings of the long-term average spectrum of speech. Ear and Hearing, 12(1), 47-54. Darwin, C. J., Denis McKeown, J., & Kirby, D. (1989). Perceptual compensation for transmission channel and speaker effects on vowel quality. Speech Communication, 8(3), 221–234. Dechovitz, D. (1977a). Information conveyed by vowels: A confirmation. Haskins Laboratory Status Report, 213–219. Dechovitz, D. R. (1977b). Information conveyed by vowels: a negative finding. The Journal of the Acoustical Society of America, 61(S1), S39–S39. Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. The Journal of the Acoustical Society of America, 27(4), 769–773. Di Lollo, V. (1964). Contrast effects in the judgment of lifted weights. Journal of Experimental Psychology, 68(4), 383–387. Dilley, L. C., & Pitt, M. A. (2007). A study of regressive place assimilation in spontaneous speech and its implications for spoken word recognition. The Journal of the Acoustical Society of America, 122, 2340-2353. 178 Dunn, H.K., & White, S.D. (1940). Statistical measurements on conversational speech. The Journal of the Acoustical Society of America, 11(3), 278-288. Ernestus, M. (2000). Voice assimilation and segment reduction in casual Dutch. A corpus-based study of the phonology-phonetics interface. Utrecht: LOT. Esposito, A. (2002). On vowel height and consonantal voicing effects: Data from Italian. Phonetica, 591(4), 197-231. Fant, G. (1960). Acoustic theory of speech production. The Netherlands: Mouton & Co., The Hague. Fenn, K. M., Shintel, H., Atkins, A. S., Skipper, J. I., Bond, V. C., & Nusbaum, H. C. (2011). When less is heard than meets the ear: Change deafness in a telephone conversation. The Quarterly Journal of Experimental Psychology, 64(7), 1442– 1456. Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. The Journal of the Acoustical Society of America, 112, 259-271. Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. The Journal of the Acoustical Society of America, 106(3), 1511-1522. Frazer, T. C. (1993). " Heartland" English: variation and transition in the American Midwest. Tuscaloosa, AL: University Alabama Press. Glidden, C. M., & Assmann, P. F. (2004). Effects of visual gender and frequency shifts on vowel category judgments. Acoustics Research Letters Online, 5(4), 132-138. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251-279. Gottfried, T.L. (1984). Effects of consonant context on the perception of French vowels. Journal of Phonetics, 12(2), 91-114. Greenberg, S. (1999). Speaking in shorthand–A syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29(2), 159-176. Halberstam, B., & Raphael, L. J. (2004). Vowel normalization: the role of fundamental frequency and upper formants. Journal of Phonetics, 32(3), 423–434. 179 Halle, M., & Stevens, K. (1962). Speech recognition: A model and a program for research. Information Theory, IRE Transactions on, 8(2), 155–159. Helson, H. (1964). Adaptation-level theory. New York: Harper & Row. Hillenbrand, J. M., Clark, M. J., & Nearey, T. M. (2001). Effects of consonant environment on vowel formant patterns. The Journal of the Acoustical Society of America, 109(2), 748-763. Hillenbrand, J.M., & Gayvert, R.T. (2005). Open source software for experiment design and control. Journal of Speech, Language, and Hearing Research, 48, 45-60. Hinton, L., Moonwomon, B., Bremner, S., Luthin, H., Van Clay, M., Lerner, J., & Corcoran, H. (2011). It’s not just the valley girls: A study of California English. Proceedings of the annual meeting of the Berkeley Linguistics Society (Vol. 13). Holt, L. L. (2005). Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychological Science, 16(4), 305-312. Holt, L. L. (2006). The mean matters: Effects of statistically defined nonspeech spectral distributions on speech categorization. The Journal of the Acoustical Society of America, 120(5), 2801-2817 Holt, L. L., Lotto, A. J., & Kluender, K. R. (2000). Neighboring spectral content influences vowel identification. The Journal of the Acoustical Society of America, 108(2), 710-722. Jacewicz, E., Fox, R. A., & Salmons, J. (2007). Vowel space areas across dialects and gender. Proceedings of the XVIth Int. Congress of Phonetic (pp. 1465–1468). Jacobsen, T., Schröger, E., & Sussman, E. (2004). Pre-attentive categorization of vowel formant structure in complex tones. Cognitive brain research, 20(3), 473-479. Johnson, K. (2004). Massive reduction in conversational American English. In Spontaneous speech: data and analysis. Proceedings of the 1st session of the 10th international symposium (pp. 29-54). Tokyo, Japan: The National International Institute for Japanese Language. Joos, M. (1948). Acoustic phonetics. Language, 24(2), 5-136. Kahane, J. C. (1982). Growth of the human prepubertal and pubertal larynx. Journal of Speech and Hearing Research, 25(3), 446-455. 180 Katz, W. F., & Assmann, P. F. (2001). Identification of children’s and adults’ vowels: Intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing. Journal of Phonetics, 29(1), 23–51. Krapp, G. P. (1925). The English language in America. New York: Frederick Ungar. Kurath, H. (Ed.) (1939). The linguistic atlas of New England. Providence: Brown University Press. Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Berlin, Germany: Mouton de Gruyter. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. The Journal of the Acoustical Society of America, 29(1), 98–104. Laing, E. J., Liu, R., Lotto, A. J., & Holt, L. L. (2012). Tuned with a tune: talker normalization via general auditory processes. Frontiers in Psychology, 3, 1-9. Liberman, A. M. (1996). Speech: A special code. Cambridge, MA: MIT Press. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431–461. Lindblom, B. (1962). Spectrographic study of vowel reduction. The Journal of the Acoustical Society of America, 35(11), 1773-1781. Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W.J. Hardcastle & A. Marchal (Eds.), Speech production and speech modelling, (pp. 403–439). Dordrecht, Netherlands: Kluwer Academic Publishers. Lindblom, B., & Sundberg, J. E. (1971). Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America, 50(4B), 1166-1179. Lindblom, B., & Studdert-Kennedy, M. (1967). On the role of formant transitions in vowel recognition. The Journal of the Acoustical Society of America, 42(4), 830– 843. Linville, S. E. (2002). Source characteristics of aged voice assessed from long-term average spectra. Journal of Voice, 16(4), 472-479. Linville, S. E., & Rens, J. (2001). Vocal tract resonance analysis of aging voice using long-term average spectra. Journal of Voice, 15(3), 323-330. 181 Lotto, A. J., & Holt, L. L. (2006). Putting phonetic context effects into context: A commentary on Fowler (2006). Attention, Perception, & Psychophysics,68(2), 178-183. Lotto, A.J., Ide-Helvie, D.L., McCleary, E.A., & Higgins, M.B. (2006). Acoustics of clear speech from children with normal hearing and cochlear implants. The Journal of the Acoustical Society of America, 119(5), 3341. Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Attention, Perception, & Psychophysics, 60(4), 602-619 Lotto A. J., & Sullivan S. C. (2007). Speech as a sound source. In W.A. Yost, R.R. Fay, & A.N. Popper (Eds.), Auditory perception of sound sources, (pp. 281-305). New York: Springer Science and Business Media, LLC. MacMahon, M.K.C. (2007). The work of Richard John Lloyd (1846–1906) and “the crude system of doctrine which passes at present under the name of Phonetics.” Historiographia Linguistica, 34(2/3), 281–331. Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology: Human Perception and Performance, 33(2), 391-409. Malécot, A. (1956). Acoustic cues for nasal consonants: An experimental study involving a tape-aplicing technique. Language, 32(2), 274–284. McGowan, R. S. (1997). Vocal tract normalization for articulatory recovery and adaptation. In K. Johnson & J. W. Mullennix (Eds.), Talker Variability in Speech Processing (pp. 211–26). San Diego: Academic Press. McGowan, R. S., & Cushing, S. (1999). Vocal tract normalization for midsagittal articulatory recovery with analysis-by-synthesis. The Journal of the Acoustical Society of America, 106(2), 1090-1105. Mendoza, E., Valencia, N., Muñoz, J., & Trujillo, H. (1996). Differences in voice quality between men and women: use of the long-term average spectrum (LTAS). Journal of Voice, 10(1), 59-66. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. The Journal of the Acoustical Society of America, 85(5), 2114-2134 . 182 Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an account of vowel normalisation. Language and cognitive processes, 25(6), 808– 839. Moon, S. J., & Lindblom, B. (1994). Interaction between duration, context, and speaking style in English stressed vowels. The Journal of the Acoustical society of America, 96, 40-55. Nakamura, M., Iwano, K., & Furui, S. (2007, April). The effect of spectral space reduction in spontaneous speech on recognition performances. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on (Vol. 4, pp. IV-473). IEEE. Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. The Journal of the Acoustical Society of America, 85(5), 2088-2113. Neel, A. T. (2008). Vowel space characteristics and vowel identification accuracy. Journal of Speech, Language, and Hearing Research, 51(3), 574-585. Nordström, P. -E . & Lindblom, B . (1975) A normalization procedure for vowel formant data. Paper 212 presented at the International Congress of Phonetic Sciences, Leeds, August 1975. Peterson, Gordon E. (1952). The information-bearing elements of speech. The Journal of the Acoustical Society of America, 24(6), 629–637. Peterson, Gordon E. (1961). Parameters of vowel quality. Journal of Speech and Hearing Research, 4(1), 10-29. Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184. Picheny, M. A., Durlach, N. I., & Braida, L. D. (1986). Speaking clearly for the hard of hearing II: Acoustic characteristics of clear and conversational speech. Journal of Speech, Language and Hearing Research, 29(4), 434-446. Pittman, A. L., Stelmachowicz, P. G., Lewis, D. E., & Hoover, B. M. (2003). Spectral characteristics of speech at the ear: Implications for amplification in children. Journal of Speech, Language and Hearing Research, 46(3), 649-657. Pluymakers, M. Ernestus, M., & Baayen, R.H. (2005). Lexical frequency and acoustic reduction in spoken Dutch, The Journal of the Acoustical Society of America, 118(4), 2561-2569. 183 Potter, R. K., & Steinberg, J. C. (1950). Toward the specification of speech. The Journal of the Acoustical Society of America, 22(6), 807–820. Recasens, D. (1985). Coarticulatory patterns and degrees of coarticulatory resistance in Catalan CV sequences. Language and Speech, 28(2), 97-114. Remez, R. E., Rubin, P. E., Nygaard, L. C., & Howell, W. A. (1987). Perceptual normalization of vowels produced by sinusoidal voices. Journal of Experimental Psychology: Human Perception and Performance, 13(1), 40-61. Rhode, W. S. (1971). Observations of the vibration of the basilar membrane in squirrel monkeys using the Mössbauer tTechnique. The Journal of the Acoustical Society of America, 49(4B), 1218–1231. Robles, L., Rhode, W. S., & Geisler, C. D. (1976). Transient response of the basilar membrane measured in squirrel monkeys using the Mössbauer effect. The Journal of the Acoustical Society of America, 59(4), 926-939. Sachs, M. B., & Young, E. D. (1979). Encoding of steady-state vowels in the auditory nerve: Representation in terms of discharge rate. The Journal of the Acoustical Society of America, 66(2), 470–479. Sjerps, M.J., Mitterer, H., McQueen, J.M. (2011a). Listening to different speakers: On the time-course of perceptual compensation for vocal-tract characteristics. Neuropsychologia, (49), 3831-3846. Sjerps, M. J., Mitterer, H., & McQueen, J. M. (2011b). Constraints on the processes responsible for the extrinsic normalization of vowels. Attention, Perception, & Psychophysics, 73(4), 1195–1215. Story, B.H. (1995). Physiologically-based speech simulation using an enhanced wavereflection model of the vocal tract (Doctoral dissertation). University of Iowa, Iowa City, IA. Story, B. H. (2005a). A parametric model of the vocal tract area function for vowel and consonant simulation. The Journal of the Acoustical Society of America, 117(5), 3231-3254. Story, B. H. (2005b). Synergistic modes of vocal tract articulation for American English vowels. The Journal of the Acoustical Society of America, 118(6), 3834-3859. Story, B. H. (2007a). Time-dependence of vocal tract modes during production of vowels and vowel sequences. The Journal of the Acoustical Society of America, 121(6), 3770-3789. 184 Story, B. H. (2007b). A comparison of vocal tract perturbation patterns based on statistical and acoustic considerations. The Journal of the Acoustical Society of America, 122(4), EL107-EL114. Story, B. H., & Bunton, K. (2010). Relation of vocal tract shape, formant transitions, and stop consonant identification. Journal of Speech, Language and Hearing Research, 53(6), 1514-1528. Story, B. H., & Titze, I. R. (1998). Parameterization of vocal tract area functions by empirical orthogonal modes. Journal of Phonetics, 26(3), 223–260. Story, B. H., Titze, I. R., & Hoffman, E. A. (1996). Vocal tract area functions from magnetic resonance imaging. The Journal of the Acoustical Society of America, 100(1), 537-554. Strange, W., Weber, A., Levy, E.S., Shafiro, V., Hisagi, M., & Nishi, K. (2007). Acoustic variability within and across German, French, and American English vowels: Phonetic context effects. The Journal of the Acoustical Society of America, 122(2), 1111-1129. Sussman, H. M. (1986). A neuronal model of vowel normalization and representation. Brain and language, 28(1), 12–23. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. The Journal of the Acoustical Society of America, 79(4), 1086-1100. Thomas, E. R. (2001). An acoustic analysis of vowel variation in New World English. Durham, NC: Duke University Press. Titze, I. R. (1984). Parameterization of the glottal area, glottal flow, and vocal fold contact area. The Journal of the Acoustical Society of America, 75(2), 570-580. Traunmüller, H. (1981). Perceptual dimension of openness in vowels. The Journal of the Acoustical Society of America, 69(5), 1465–1475. Van Bergem, D. R., Pols, L. C. W., & Beinum, F. J. (1988). Perceptual normalization of the vowels of a man and a child in various contexts. Speech Communication, 7(1), 1–20. Verbrugge, R. R., Strange, W., Shankweiler, D. P., & Edman, T. R. (1976). What information enables a listener to map a talker’s vowel space? The Journal of the Acoustical Society of America, 60(1), 198-212. 185 Vorperian, H. K., Kent, R. D., Lindstrom, M. J., Kalina, C. M., Gentry, L. R., & Yandell, B. S. (2005). Development of vocal tract length during early childhood: A magnetic resonance imaging study. The Journal of the Acoustical Society of America, 117(1), 338-350. Vorperian, H. K., Wang, S., Chung, M. K., Schimek, E. M., Durtschi, R. B., Kent, R. D., Ziegert, A.J., & Gentry, L. R. (2009). Anatomic development of the oral and pharyngeal portions of the vocal tract: An imaging study. The Journal of the Acoustical Society of America, 125(3), 1666-1678. Warner, N. L. (2005). Reduction of flaps: Speech style, phonological environment, and variability. The Journal of the Acoustical Society of America, 118(3), 2035. Warner, N., Fountain, A., & Tucker, B. V. (2009). Cues to perception of reduced flaps. The Journal of the Acoustical Society of America, 125, 3317-3327. Watkins, A. J. (1991). Central, auditory mechanisms of perceptual compensation for spectral-envelope distortion. The Journal of the Acoustical Society of America, 90(6), 2942-2955. Watkins, Anthony J., & Makin, S. J. (1994). Perceptual compensation for speaker differences and for spectral-envelope distortion. The Journal of the Acoustical Society of America, 96(3), 1263–1282. Watkins, A. J., & Makin, S. J. (1996a). Some effects of filtered contexts on the perception of vowels and fricatives. The Journal of the Acoustical Society of America, 99(1), 588-594. Watkins, Anthony J., & Makin, S. J. (1996b). Effects of spectral contrast on perceptual compensation for spectral-envelope distortion. The Journal of the Acoustical Society of America, 99(6), 3749–3757. Williams, D. (1986). Role of dynamic information in the perception of coarticulated vowels (Doctoral dissertation). University of Connecticut, Storrs, CT.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement