GENERAL AUDITORY MODEL OF ADAPTIVE PERCEPTION OF SPEECH by Antonia David Vitela

GENERAL AUDITORY MODEL OF ADAPTIVE PERCEPTION OF SPEECH by Antonia David Vitela
GENERAL AUDITORY MODEL OF ADAPTIVE PERCEPTION OF SPEECH
by
Antonia David Vitela
_____________________
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF SPEECH, LANGUAGE, AND HEARING SCIENCES
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2012
2
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation
prepared by Antonia David Vitela
entitled General Auditory Model of Adaptive Perception of Speech
and recommend that it be accepted as fulfilling the dissertation requirement for the
Degree of Doctor of Philosophy
_______________________________________________________________________
Date: 11/27/12
Andrew Lotto
_______________________________________________________________________
Date: 11/27/12
Brad Story
_______________________________________________________________________
Date: 11/27/12
Kate Bunton
_______________________________________________________________________
Date: 11/27/12
Stephen Wilson
_______________________________________________________________________
Date: 11/27/12
Natasha Warner
Final approval and acceptance of this dissertation is contingent upon the candidate’s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
________________________________________________ Date: 11/27/12
Dissertation Director: Andrew Lotto
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University Library
to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission, provided
that accurate acknowledgment of source is made. Requests for permission for extended
quotation from or reproduction of this manuscript in whole or in part may be granted by
the head of the major department or the Dean of the Graduate College when in his or her
judgment the proposed use of the material is in the interests of scholarship. In all other
instances, however, permission must be obtained from the author.
SIGNED: Antonia David Vitela
4
ACKNOWLEDGEMENTS
I would like to express my sincere appreciation to the following individuals who
greatly contributed to the development and completion of this project:
To Andrew Lotto, my advisor, for making the world of academia seem magical
and inspiring, for teaching me the importance of a good story, and for the
constant support and advice over the past six years.
To my committee members, for all of their guidance, encouragement, insights,
criticisms, and help along the way.
To the ACNE Lab, for all of the support, the experiences, and the friendships
made.
To the Speech, Language, and Hearing Sciences Department for being my
academic home for the past 10 years.
To my family and Jay, for all of their love and support.
This work was supported by the National Institutes of Health: F31 DC011698.
5
DEDICATION
To
Ladefoged and Broadbent (1957)
“Please say what this word is”
For paving the way and inspiring so many researchers, including me.
6
TABLE OF CONTENTS
LIST OF FIGURES ...........................................................................................................10
LIST OF TABLES .............................................................................................................14
ABSTRACT .......................................................................................................................15
CHAPTER 1 INTRODUCTION .......................................................................................17
1.1 Acoustic Variability in the Speech Signal ............................................................17
1.1.1 Contextual Influences on Phonemic Representations .......................................18
1.1.2 Within-Talker Variability .................................................................................24
1.1.3 Between-Talker Variability ..............................................................................26
1.2 Traditional Theories of Talker Normalization....................................................29
1.2.1 Intrinsic Theories ..............................................................................................30
1.2.2 Extrinsic Theories .............................................................................................34
1.2.3 Other Approaches .............................................................................................40
1.3 Long-term Average Spectrum (LTAS) Model ....................................................42
CHAPTER 2 GENERAL AUDITORY MODEL..............................................................49
2.1 Limitations of Previous Stimuli ............................................................................49
2.2 TubeTalker .............................................................................................................50
2.2.1 TubeTalker Description ....................................................................................51
2.2.2 Development of the TubeTalker .......................................................................53
2.2.3 Creating Different Talkers ................................................................................54
2.3 General Auditory Model .......................................................................................56
CHAPTER 3 GENERAL METHODOLOGY AND PRELIMINARY STUDIES ...........61
3.1 Predictions ..............................................................................................................61
3.2 General Methodology ............................................................................................63
3.2.1 Participants ........................................................................................................63
3.2.2 Stimuli ...............................................................................................................63
3.2.3 Procedure ..........................................................................................................65
3.2.4 Predictions.........................................................................................................66
3.2.4.1 Deriving the LTAS ....................................................................................69
3.2.5 Analysis Methods..............................................................................................71
3.3 Prediction #1: Shifts in perception of a target phoneme can be elicited when
the LTAS of preceding contexts predict opposite effects on the target ..................72
3.3.1 Preliminary Experiment 1 .................................................................................72
3.3.2 Stimuli and Predictions .....................................................................................72
3.3.3 Participants and Procedure ................................................................................73
3.3.4 Results and Discussion .....................................................................................73
7
TABLE OF CONTENTS – Continued
3.4 Prediction #2: Only those talker differences resulting in a change in LTAS that
corresponds to a differential contrast of an important phonetic cue will results in
a shift in perception .....................................................................................................74
3.4.1 Preliminary Experiment 2 .................................................................................74
3.4.2 Stimuli and Predictions .....................................................................................74
3.4.3 Participants and Procedure ................................................................................75
3.3.4 Results and Discussion .....................................................................................76
3.5 Prediction #3: Backwards stimuli should cause the same effect as the same
stimuli played forwards ...............................................................................................77
3.5.1 Preliminary Experiment 3 .................................................................................77
3.5.2 Stimuli and Predictions .....................................................................................78
3.5.3 Participants and Procedure ................................................................................78
3.5.4 Results and Discussion .....................................................................................78
3.6 Prediction #4: The LTAS that is calculated is an extraction of the neutral
vowel; therefore, a talker’s neutral vowel should have the same effect on
perception as a phrase produced by that talker........................................................80
3.6.1 Preliminary Experiment 4 .................................................................................80
3.6.2 Stimuli and Predictions .....................................................................................81
3.6.3 Participants and Procedure ................................................................................82
3.6.4 Results and Discussion .....................................................................................82
3.7 General Discussion .................................................................................................84
CHAPTER 4 THE TEMPORAL WINDOW ....................................................................87
4.1 Temporal Window .................................................................................................87
4.2 Experiment 1 ..........................................................................................................88
4.2.1 Methods.............................................................................................................89
4.2.1.1 Stimuli and Predictions ..............................................................................89
4.2.1.2 Participants and Procedure .........................................................................94
4.2.2 Results ...............................................................................................................94
4.2.3 Discussion .........................................................................................................95
4.3 Experiment 2 ..........................................................................................................97
4.3.1 Methods.............................................................................................................98
4.3.1.1 Stimuli and Predictions ..............................................................................98
4.3.1.2 Participants and Procedure .........................................................................99
4.3.2 Results .............................................................................................................100
4.3.3 Discussion .......................................................................................................102
4.4 Experiment 3 ........................................................................................................103
8
TABLE OF CONTENTS – Continued
4.4.1 Methods...........................................................................................................103
4.4.1.1 Stimuli and Predictions ............................................................................103
4.4.1.2 Participants and Procedure .......................................................................107
4.4.2 Results .............................................................................................................108
4.4.3 Discussion .......................................................................................................108
4.5 General Discussion ...............................................................................................108
CHAPTER 5 EXTRACTING THE NEUTRAL VOWEL ..............................................113
5.1. Extracting the Neutral Vocal Tract...................................................................113
5.2 Experiment 4 ........................................................................................................114
5.2.1 Methods...........................................................................................................115
5.2.1.1 Signals and Predictions ............................................................................115
5.2.1.2 Procedure .................................................................................................117
5.2.2 Results .............................................................................................................117
5.2.3 Discussion .......................................................................................................122
5.3 Experiment 5 ........................................................................................................123
5.3.1 Methods...........................................................................................................124
5.3.1.1 Signals and Predictions ............................................................................124
5.3.1.2 Procedure .................................................................................................126
5.3.2 Results .............................................................................................................126
5.3.3 Discussion .......................................................................................................127
5.4 Experiment 6 ........................................................................................................132
5.4.1 Methods...........................................................................................................132
5.4.1.1 Stimuli and Predictions ............................................................................132
5.4.1.2 Participants and Procedure .......................................................................133
5.4.2 Results .............................................................................................................134
5.4.3 Discussion .......................................................................................................134
5.5 General Discussion ...............................................................................................135
CHAPTER 6 HARMONIC COMPLEX EFFECTS ........................................................139
6.1 Harmonic Complex Effects .................................................................................139
6.2 Experiment 7 ........................................................................................................140
6.2.1 Methods...........................................................................................................141
6.2.1.1 Stimuli and Predictions ............................................................................141
6.2.1.2 Participants and Procedure .......................................................................142
6.2.2 Results .............................................................................................................143
6.2.3 Discussion .......................................................................................................144
6.3 Experiment 8 ........................................................................................................146
9
TABLE OF CONTENTS – Continued
6.3.1 Methods...........................................................................................................146
6.3.1.1 Stimuli and Predictions ............................................................................146
6.3.1.2 Participants and Procedure .......................................................................148
6.3.2 Results .............................................................................................................148
6.3.3 Discussion .......................................................................................................150
6.4 Experiment 9 ........................................................................................................150
6.4.1 Methods...........................................................................................................150
6.4.1.1 Stimuli and Predictions ............................................................................150
6.4.1.2 Participants and Procedure .......................................................................152
6.4.2 Results .............................................................................................................152
6.4.3 Discussion .......................................................................................................154
6.5 General Discussion ...............................................................................................154
CHAPTER 7 DISCUSSION AND CONCLUSIONS .....................................................156
7.1 Overview Summary .............................................................................................156
7.2 Specific Aim #1: Determine the temporal window over which the LTAS is
Calculated ...................................................................................................................160
7.2.1 Interpretation and Future Research ................................................................161
7.3 Specific Aim #2: Determine whether or not the LTAS reflects the neutral
vocal tract ...................................................................................................................162
7.3.1 Interpretation and Future Research .................................................................164
7.4 Determine what aspects of the comparisons of the LTAS and the target
predict shifts in target categorization ......................................................................166
7.4.1 Interpretation and Research ............................................................................167
APPENDIX A. AMBIGUOUS TARGET SELECTION ................................................173
APPENDIX B. FILTERED EXPERIMENTS.................................................................174
REFERENCES ................................................................................................................176
10
LIST OF FIGURES
Figure 1 Spectrograms of productions of /bub/, /dud/, /bɅb/, and /dɅd/ .........................20
Figure 2 F1xF2 vowel space for three synthesized talkers ...............................................27
Figure 3 A pseudo-midsagittal view of the neutral vocal tract shapes for the five
“talkers” created from the TubeTalker ..............................................................................55
Figure 4 F1xF2 vowel space adjusted for the neutral vowel ............................................59
Figure 5 Spectrograms of the endpoints of the the /dɑ/-/gɑ/ target series ........................64
Figure 6 F1xF2 vowel space of the vowel portion of the 12-step “bit” to “bet” target
series ..................................................................................................................................65
Figure 7 An example of the computer screen participants see during experimentation ...67
Figure 8 LTAS comparison between the carrier phrases “He had a rabbit and a” of the
LongVT and ShortVT and the ambiguous target from the “bit” to “bet” series................69
Figure 9 LTAS comparison between the carrier phrases “Abracadabra” of the LongVT
and ShortVT and the ambiguous target from the “da” to “ga” series ................................73
Figure 10 Results for Preliminary Experiment 1 ..............................................................74
Figure 11 LTAS comparison between the carrier phrases “He had a rabbit and a” of the
O+P-VT and O-P+VT and the ambiguous target from the “bit” to “bet” series ...............75
Figure 12 Results for Preliminary Experiment 2 (LongVT vs. ShortVT) ....................... 76
Figure 13 Results for Preliminary Experiment 2 (O-P+ vs. O+P-) ..................................76
Figure 14 Results for Preliminary Experiment 3 (Forwards) .......................................... 79
Figure 15 Results for Preliminary Experiment (Backwards) ...........................................79
11
LIST OF FIGURES - Continued
Figure 16 LTAS comparison between the carrier phrases “Please say what this word is”
and the neutral vowel of only the ShortVT, along with the ambiguous target from the
“da” to “ga” ....................................................................................................................... 81
Figure 17 Results for Preliminary Experiment 4 (Forwards) .......................................... 83
Figure 18 Results for Preliminary Experiment 4 (Backwards)....................................... 83
Figure 19 Waveform and spectrogram of the LongVT’s production of “He had a rabbit
and a” for the stimuli used in Experiment 1 ......................................................................90
Figure 20 LTAS comparisons for the stimuli in Experiment 1 and Experiment 2 ...........92
Figure 21 Results for Experiment 1 ..................................................................................96
Figure 22 Waveform and spectrogram of the LongVT’s production of “He had a rabbit
and a” for the stimuli used in Experiment 2 .....................................................................99
Figure 23 Results for Experiment 2 ................................................................................101
Figure 24 LTAS comparisons for the stimuli in Experiment 3 ......................................105
Figure 25 Results for Experiment 3 ................................................................................107
Figure 26 LPC comparisons between Female1’s neutral vowel and different sentence
segments ..........................................................................................................................118
Figure 27 LPC comparisons between Female2’s neutral vowel and different sentence
segments ...........................................................................................................................119
Figure 28 LPC comparisons between Male1’s neutral vowel and different sentence
segments ...........................................................................................................................120
12
LIST OF FIGURES - Continued
Figure 29 LPC comparisons between Male2’s neutral vowel and different sentence
segments ...........................................................................................................................121
Figure 30 The vowel trajectory synthesized from the LongVT .....................................125
Figure 31 LPC comparisons of the neutral vowel and the vowel trajectory segments
synthesized from LongVT ...............................................................................................128
Figure 32 LPC comparisons of the neutral vowel and the vowel trajectory segments
synthesized from the ShortVT .........................................................................................129
Figure 33 LPC comparisons of the neutral vowel of one TubeTalker compared to the full
vowel trajectory of the other TubeTalker .......................................................................130
Figure 34 LPC comparisons of the ShortVT, LongVT, and ambiguous target with both
the synthesized neutral vowel and the vowel trajectory ..................................................133
Figure 35 Results for Experiments 5 ..............................................................................134
Figure 36 The spectra of three of the harmonic complexs and the ambiguous target from
the “bit” to “bet” series ....................................................................................................142
Figure 37 Results for Experiment 7 ................................................................................143
Figure 38 Experiment 7 individual differences for each harmonic complex..................145
Figure 39 Spectra of the ambiguous target and the harmonic complex with a highamplitude peak at 700 Hz in both the matched amplitude condition and the high
amplitude condition .........................................................................................................147
Figure 40 Results for Experiments 8 ..............................................................................148
Figure 41 Experiment 8 individual differences for the matched peak condition ............149
13
LIST OF FIGURES - Continued
Figure 42 Experiment 8 individual differences for the high peak condition ..................149
Figure 43 Spectra of the ambiguous target and the harmonic complex with a trough at
700 Hz in both the trough and the large trough condition conditions .............................151
Figure 44 Results for Experiment 9 ................................................................................152
Figure 45 Experiment 9 individual differences for the small trough condition .............153
Figure 46 Experiment 9 individual differences for the large trough condition ..............153
14
LIST OF TABLES
Table 1 A summary of the stimuli used in each of the preliminary studies......................67
Table 2 A list of the sentences recorded from the four participants and the order in which
each of the sentences was appended for each talker to make the stimuli ........................116
Table 3 The duration of each stimulus used in the LTAS comparisons with the neutral
vowel from each talker.....................................................................................................116
Table 4 Results of Experiment 4 .....................................................................................117
Table 5 Results of Experiment 5 .....................................................................................127
15
ABSTRACT
One of the fundamental challenges for communication by speech is the variability
in speech production/acoustics. Talkers vary in the size and shape of their vocal tract, in
dialect, and in speaking mannerisms. These differences all impact the acoustic output.
Despite this lack of invariance in the acoustic signal, listeners can correctly perceive the
speech of many different talkers. This ability to adapt one’s perception to the particular
acoustic structure of a talker has been investigated for over fifty years. The prevailing
explanation for this phenomenon is that listeners construct talker-specific representations
that can serve as referents for subsequent speech sounds. Specifically, it is thought that
listeners may either be creating mappings between acoustics and phonemes or extracting
the vocal tract anatomy and shape for each individual talker. This research focuses on an
alternative explanation. A separate line of work has demonstrated that much of the
variance between talkers’ productions can be captured in their neutral vocal tract shape
(that is, the average shape of their vocal tract across multiple vowel productions). The
current model tested is that listeners compute an average spectrum (long term average
spectrum – LTAS) of a talker’s speech and use it as a referent. If this LTAS resembles
the acoustic output of the neutral vocal tract shape – the neutral vowel – then it could
accommodate some of the talker based variability. The LTAS model results in four main
hypotheses: 1) during carrier phrases, listeners compute an LTAS for the talker; 2) this
LTAS resembles the spectrum of the neutral vowel; 3) listeners represent subsequent
targets relative to this LTAS referent; 4) such a representation reduces talker-specific
16
acoustic variability. The goal of this project was to further develop and test the
predictions arising from these hypotheses. Results suggest that the LTAS model needs to
be further investigated, as the simple model proposed does not explain the effects found
across all studies.
17
CHAPTER 1
INTRODUCTION
1.1 Acoustic Variability in the Speech Signal
In the early 1940s, two researchers (Al Liberman and Frank Cooper) set out to
create a reading machine for the blind. At the time, they assumed that speech was simply
a learned one-to-one mapping from sound to consonant/vowel identity and, therefore,
they could create their own map for a listener to learn by ascribing a distinctive sound to
each consonant/vowel. All they would need was to create a machine that could scan the
letters of a text and for each individual letter, produce its analogous new sound. Then,
they would train the blind reader to know which sounds corresponded to which letters
and voilà! The blind person would be able to listen to any written text of their choosing.
While they had felt this task was accomplishable and straightforward, they quickly
discovered that the perception of speech sounds is a far more complex process than they
had believed.
They were unable to find sounds that listeners could learn and understand at a rate
comparable to that of true readers (their best participants plateaued at a fifth-grade
“reading” level), so they began to look to real speech for answers. Specifically, they set
out to discover the acoustic patterns and cues that identify speech sounds (phonemes),
hoping this information could help in the creation of more listener-friendly sounds for the
reading machine. Their search spawned an entire field, because they never found these
distinctive acoustic patterns. It turned out that there was a great amount of acoustic
18
variability in the speech signal. There were no specific patterns that identified each
phoneme; in fact, different acoustic patterns could lead to the same phonemic percept and
the same acoustic patterns could lead to different phonemic percepts, depending on the
surrounding phonemic context. This problem has been referred to as the “lack of
invariance” and, while some researchers continue to develop theories about invariant cues
(Blumstein & Stevens, 1979; 1980; 1981), others believe that listeners must be somehow
compensating for the variance via additional processing or special mechanisms (these
will be discussed in Section 1.2). (See Liberman (1996) for a full account of the reading
machine and the search for invariant cues in speech perception.)
The acoustic variability inherent in the realizations of phonemes comes from at
least three major sources. The first is the surrounding context of any particular phoneme
(how the phonemes that precede and follow a target sound can affect the perception of
that sound). The second is within-talker changes in speaking style (how clear vs.
conversational speech can change the speech signal of the same talker). And, the last is
between-talker variability (e.g., how specific differences in anatomy, physiology, and
accents between talkers affect the acoustic output). Each of these sources of variability
influences the speech signal in parallel, but has been studied separately and so will be
discussed separately below.
1.1.1 Contextual Influences on Phonemic Representations
For this section, let us pretend that there is but one speech apparatus that is
exactly the same (in size, shape, source, articulatory patterns, etc.,) and that, even across
genders, every human has this one structure from which to produce speech. From
19
productions of this one vocal tract, we hope to analyze and find the specific speech
patterns or cues for every phoneme. Recordings of consonant-vowel-consonant (CVC)
combinations are made and the spectrograms of each production are perused for invariant
speech patterns.
The most common aspect of the signal to focus on is the formants (resonances of
the vocal tract shape). These peaks or areas of energy are the most salient in a spectrum
or spectrogram and are apparent in the neural encoding of steady-state vowels (Sachs &
Young, 1979), so the patterns and transitions that they create may be the auditory
system’s key to phonemic categorization. Figure 1 shows spectrograms for two pairs of
words (each pair shares the same vowel): /bub/, /dud/, /bɅb/, and /dɅd/. They were
recorded from a male speaker. The formants are tracked by the dots with the first two
labeled. Perceptually, the vowel in each pair of words would be identified as the same
(either /u/ or /Ʌ/), but note the differences in formant frequency. For /bub/, the vowel’s
F1 is at 341 Hz and F2 is at 985 Hz, but for /dud/, the vowel’s F1 is at 265 Hz and F2 is
at 1312. The second formant is much higher in the d-vowel-d context than in the bvowel-b context, while the first formant is lower. Yet, again, both of these vowels are
perceived as /u/. For /bɅb/, the vowel’s F1 is at 684 Hz and the F2 is at 1213 Hz. For
/dɅd/, the vowel’s F1 is at 616, and the F2 is at 1414. Again, the second formant is
higher in the dVd context and the first formant is lower. (All formant measurements
were taken at the steady-state portion of the vowel: /bub/ = .126 msec, /dud/ = .164 msec,
/bɅb/ = .095 msec, and /dɅd/ = .097 msec.)
20
Figure 1. Spectrograms of
productions of /bub/, /dud/,
/bɅb/, and /dɅd/. The dots
track the location of the
formants. The first two
formants are labeled.
21
Lindblom (1963) attributed these formant differences to “undershoot.” Essentially,
talkers do not reach their articulatory goal when speaking at a certain rate. This
“undershoot” of the articulatory goal leads to differences in the formants, because the
articulatory target for a certain vowel may be closer or further away in certain
consonantal contexts. That is, if a consonant’s and vowel’s articulatory targets are close,
the vowel’s formant will be closer to its characteristic frequencies than if the two targets
are further away from one another. This leads to variation in the frequencies of each
formant. In any case, the surrounding context of a vowel directly affects the formant
patterns so that the formant frequencies of a vowel will not be the same across different
phonemic contexts. This means that there is no fixed formant pattern that identifies each
individual vowel sound. Listeners cannot simply use a template approach to match a
specific formant pattern to a specific vowel, since this would lead to misidentifications,
depending on the consonant context that surrounds the vowel.
Context-specific changes in acoustics have been studied over a wide range of
different phonemes. Hillenbrand, Clark, and Nearey (2001) noted significant changes in
formant patterns for eight different vowels (/i, ɪ, ɛ, æ, ɑ, ʊ, u, ʌ/) across 42 possible
consonant initial (/h, b, d, g, p, t, k/) and final (/b, d, g, p, t, k/) combinations when the
patterns were compared to the isolated vowels’ formant patterns. The most substantial
shift was found when a rounded vowel was preceded by an alveolar (e.g., “dude”). These
effects have also been shown with consonants (Delattre, Liberman, & Cooper, 1955;
Malecot, 1956; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Blumstein
& Stevens, 1979) and in other languages (Gottfried, 1984, Recasens, 1985; Eposito,
22
2002; Strange, et al., 2007; Chen, 2008). With this amount of variation in the speech
signal, it does not seem that listeners could simply learn patterns for each context, since
there do not appear to be clear patterns that identify each phoneme.
The task of the listener would seem quite daunting. Somehow, a listener must
deal with the variability in the speech signal to correctly perceive speech or to simply
identify vowel sounds. Lindblom and Studdert-Kennedy (1967) examined whether
listeners’ perceptions of vowels were context-sensitive, as would seem necessary given
the context specific nature of the acoustics. First, they synthesized a 20-step /ɪ/ to /ʊ/
vowel series by manipulating the second and third formant in equal steps. Then, they
created two CVC (consonant-vowel-consonant) series (or, rather, glide-vowel-glide
series) with /jVj/ and /wVw/ providing the surrounding context. These particular glides
were chosen to ensure that the formant transitions leading to the vowel portion of the
stimulus moved in opposite directions. The second and third formant transitions for /j/
started high and then sloped downward to the vowel portion, whereas formants for /w/
started low and sloped upward. A two-alternative forced-choice task was utilized, in
which listeners were asked to identify the vowel as either the /ɪ/ in “bit” or as the /ʊ/ in
“book” with presentations in isolation or one of the CVC contexts. The authors posited
two potential outcomes: (1) Listeners’ responses would be the same for stimuli that had
identical formant positions at the midpoints (the most steady-state portion of the vowel);
or (2) Listeners’ responses would vary as a function of the context. The results showed
the latter. Relative to the phonemic boundary of the vowels in isolation, more /ɪ/
responses were made in the /wVw/ context and more /ʊ/ responses were made in the /jVj/
23
context. Thus, the study demonstrated that vowel perception was context sensitive. This
effect was replicated in a series of studies by Williams (1986).
Nearey (1989) followed up on this line of work to see if the effect would extend
to other contexts, since glide-vowel-glide syllables are not common in English. Using
the paradigm of Lindblom and Studdert-Kennedy (1967), Nearey created an isolated
vowel series, moving from /ɒ/ to /ɛ/ to /ʌ/ (vowels of Western Canadian English) and
then, used these vowels to make two different CVC series, /bVb/ and /dVd/. His findings
provide further evidence that the context can elicit a perceptual shift in vowel
categorization. When comparing isolation to the /dVd/ context, the boundary was shifted
towards more /ɒ/ responses for the /ɒ/ to /ʌ/ series and towards more /ʌ/ responses for the
/ʌ/ to /ɛ/ series. Both boundaries shifted in the opposite direction for the /bVb/ context
(but the shift was only significant for the /ɒ/ to /ʌ/ series). The /ʌ/ to /ɛ/ effect was also
replicated by Holt, Lotto, and Kluender (2000).
In summary, analyses of speech CVCs have revealed that formant frequencies for
vowels are affected by the surrounding phonemes. Perceptually, different formant
frequencies can elicit the same vowel identification and the same formant frequencies can
elicit different vowel identification based on the context that precedes and follows the
target. This indicates that even if every person had the same vocal tract, the lack of
invariance problem would still be present. Most surprising is that listeners are capable of
compensating for all of this variability in order to correctly perceive the intended
message of the speaker. As impressive as this is, context-sensitive variability is only one
source of acoustic variability.
24
1.1.2 Within-Talker Variability
In the previous section, evidence was provided to show that even if every human
had the same vocal tract, there would still be variability in the speech signal. Even that
assumption is more complex than was shown previously, because the same speaker does
not always choose to speak in the same way. That is, there are different speaking styles
that a talker can choose to use depending on the context of the situation. One type of
speaking style is known as clear speech (or “read” or “laboratory” speech). Typical
studies that have looked at formant values and the possible defining cues for different
vowels run these analyses on clear speech. Typical speakers, however, do not use this
type of speech in regular, everyday conversations but use “conversational” or “reduced”
speech. In fact, Lindblom (1990) suggests that changes in speaking style actually should
be considered a continuum from hypo-articulated to hyper-articulated speech (H&H
Theory). The speaker, essentially, chooses where they fall along this continuum based on
the listener. If the topic of conversation is one that the listener is familiar with, the
speaker will choose to speak in a more relaxed manner, falling on the hypo or reduced
side of the continuum. If the topic is not as well known to the listener, the speaker will
choose to speak more clearly, falling on the hyper or clear side of the continuum. Even
though there are two extremes to this continuum, the speaker could choose to speak at
any place in-between and could also vary their style during the conversation as the topic
changes or even between old and new information in the same sentence. It is a balance
between providing a clear message and minimizing articulatory effort.
25
Several studies have been carried out to characterize the acoustic changes that
accompany shifts in speaking style. In particular, there has been an in-depth description
of the results of a talker trying to produce “clear” intelligible speech (e.g., Picheny,
Durlach & Braida, 1986; Moon & Lindblom, 1994; Bradlow, Kraus & Hayes, 2003;
Lotto, Ide-Helvie, McCleary & Higgins, 2006). The consequences of clear speech
production include increased segment durations, greater f0 ranges, and shifts in vowel
formant frequencies resulting in an expanded vowel space (Chen, 1980; Picheny et al.,
1986; Moon & Lindblom, 1994; Ferguson & Kewley-Port, 2002; Bradlow et al,. 2003). It
should be noted that in most of these studies the comparisons are between very “clear”
speech, such as one may produce when speaking to an interlocutor who does not share
the native language and the relatively clear speech that is normally produced in
laboratory recordings. That is, much of the research has been conducted on the “Hyper”
end of Lindblom’s H&H continuum. Recently, there has been a growth in interest in the
hypoarticulated, reduced speech that is probably more characteristic of typical
conversational speech (e.g., Greenberg, 1999; Ernestus, 2000; Johnson, 2004; Warner,
Fountain & Tucker, 2009). This speech is typified by alterations and/or deletions of
segments or productions that are not articulated in their canonical form (Johnson, 2004;
Pluymakers, Ernestus, & Baayen, 2005; Warner, 2005; Dilley & Pitt, 2007; Nakamura,
Iwano & Furui, 2007;). Remarkably, many of the cues that researchers have focused on
to solve the problem of the “lack of invariance” may not even be present in the speech
signal during typical conversation! Listeners’ ability to correctly perceive speech is quite
26
amazing, but even this is a simplification of the problem. There are still differences
between talkers that need to be taken into account.
1.1.3 Between-Talker Variability
If predicting the acoustic output of a specific speaker did not seem complex
enough, between-talker variability must be considered. Obviously, every human does not
have the same vocal tract. Every human has a distinct vocal tract shape with inherent
differences, ranging from speaker-specific idiosyncrasies to more general differences due
to gender or stage of development. On top of that, talkers’ speech characteristics are a
product of their geographical location and socioeconomic status, each with its own accent
or dialectal variation. Essentially, everyone’s productions are a unique amalgamation of
their physiology and background.
This “uniqueness” was demonstrated in a classic study by Peterson and Barney
(1952). They recruited 33 men, 28 women, and 15 children from the Mid-Atlantic region
of the United States and recorded them producing ten vowels in an /hVd/ context. For
each gender, they analyzed the first three formants and f0 and plotted each vowel in an
F1xF2 space. The measurements showed a great amount of variability across and within
each group, and the vowel space plots showed large overlap across different vowel
categories. There were some trends, though. For instance, the f0 and formant
frequencies of male productions were the lowest and the children’s productions were the
highest with female productions falling in between.
27
It may seem obvious that there would be systematic differences in formant
frequency measurements across the three groups, as a direct result of the length of the
vocal tract. For example, Figure 2 shows the F1xF2 vowel space for three different
synthesized talkers who vary only in length, while all other parameters (e.g., source, basic
vocal tract shape, articulation pattern) are constant. These talkers were created using a
vocal tract model that is based on measurements of vocal tract area shapes (the model is
described in Chapter 2; Story, 2005a). The standard vocal tract is representative of a
typical male speaker with a vocal tract length of 17.5 cm from glottis to lips. The short
vocal tract is 20% shorter (14 cm) and the long vocal tract is 20% longer (21 cm).
Figure 2 shows that, even in this constrained case, there is a great deal of variability and
overlap across and within the vowel categories for these three “talkers.”
Figure 2. F1xF2 vowel space for three synthesized talkers who vary
only in vocal tract length (Standard = 17.5 cm, Long = 21 cm, and
Short = 14 cm).
28
This problem of vocal tract variation becomes even more complex when gender
and age are allowed to vary. Children’s vocal tracts are not just shorter than adults’ but
are configured differently. The almost 90-degree bend common to adult vocal tracts
develops over time (infants’ vocal tracts are initially structured in a way that allows them
to feed and breathe at the same time). This and other structures of the vocal tract mature
at different rates so that children are learning to manipulate an ever-changing vocal
apparatus. Then, at puberty, children’s vocal tracts change more drastically depending on
their gender. In particular, the pharyngeal region of a male vocal tract grows
significantly longer than that of a female (Kahane, 1982; Fitch & Geidd, 1999;
Vorperian, Kent, Lindstrom, Kalina, Gentry, & Yandell, 2005; Vorperian, Wang, Chung,
Schimek, Durtschi, Kent, Ziegert, & Gentry, 2009). There is also evidence that younger
and older adults differ in vocal tract size and basic shape in terms of the ratio of their oral
to pharyngeal cavities (Kahane, 1980; Xue & Hao, 2003). Thus, not only are there
inherent differences in speaker’s anatomy but the anatomy is constantly changing over
the lifespan.
Moreover, there are regional dialects and accents that vary across and within
nationalities. In regards to North-American English, entire books have been devoted to
regional variability in the acoustics of speech production (Krapp, 1925; Kurath, 1939;
Frazer, 1993; Thomas, 2001; Labov, Ash, & Boberg, 2006). Vowel spaces based on F1
and F2 frequencies make indicate that there is a great deal of variability across and even
within dialects (Clopper, Pisoni, & de Jong, 2005; Jacewicz, Fox, & Salmons, 2007;
Neel, 2008). Complicating the issue more is that these dialects and accents are not stable
29
but change over time. For instance, vowel shifts are a type of historical sound change
that occur when speakers of a specific geographical region shift or change the way in
which they produce vowels. One example is the California vowel shift. Compared to
vowel recordings taken in the 1950s, the back vowels /u/, /ʊ/, and /o/ are becoming more
fronted and less rounded when produced by younger adults (Hinton, Moonwomon,
Bremner, Luthin, Van Clay, Lerner, & Corcoran, 1987). In short, there is a large amount
of variation across talkers that affect the acoustic output of any single talker.
Yet, listeners are able to correctly perceive the intended message of the speaker.
Despite all of the variability caused by the surrounding phonemic context, by differences
in the anatomy of the vocal tract, by accents/dialects, and by speaking styles, listeners are
somehow able to compensate for the differences between talkers. The process by which
this happens is known as “talker normalization.” Listeners are able to “normalize” or
adjust their perception and tune to the particularities of an individual speaker so that they
can accurately identify phonemes across a wide range of talkers. The following section
lays out the predominant theories on how this process works.
1.2 Traditional Theories of Talker Normalization
Two main mechanisms have been proposed to explain the process of talker
normalization: intrinsic and extrinsic. These two differ in what they consider the
“source” of the information for accurate vowel perception. That is, the theories do not
agree on which part of the speech signal holds the necessary acoustic information that
listeners are using to understand different talkers. Intrinsic theories assume that the
source is within the segment of the acoustic signal for that phoneme – that there is a one-
30
to-one mapping from properties of the signal to phoneme perception, even if these
properties are difficult to find. The other group of theories is known as extrinsic theories.
These maintain that while a speaker is talking, the listener is learning information specific
to that speaker. This learned knowledge is the source and helps the listener understand
subsequent productions from that speaker. The research project described in this
dissertation concerns information available to the listener that is extrinsic to the target
vowel. (The terms intrinsic and extrinsic were first coined by Ainsworth (1975).)
1.2.1 Intrinsic Theories
Intrinsic theories of talker normalization assume that the important cues for
identifying specific vowel sounds lie within the portion of the signal that corresponds to
the phoneme of interest. These cues are broadly defined and can be any acoustic
information that lies within the vowel portion (including transitions). In most theories,
the cues are a mathematical transformation of the values of a combination of the formants
and sometimes the fundamental frequency. However, the raw frequency values are
unlikely candidates for perception, because of the encoding of these frequencies on the
basilar membrane is non-linear (Rhode, 1971; Robles, Rhode, & Geisler, 1976). It is
tempting to suggest, then, that that these inherent non-linearities in the auditory system
may actually provide some “normalization” across talkers, since they modify the
incoming acoustic signal in a specific way. So instead of using raw values, researchers
tend to use psychophysical scales (those based on experiments that have tested listener’s
pitch perception). In these studies, listeners are asked to judge whether two tones have
the same or different pitches, and it has been found that the higher the frequency of the
31
tones, the more difference there needs to be between them for listeners to notice a
difference. There are numerous scales that take this into account (e.g., Koenig, mel,
Bark, Equivalent Rectangular Bandwidth). These scales are seen to be more
representative of how frequencies are encoded and so most intrinsic theories employ
them, rather than using the absolute frequency of the formant peaks.
The first theory of this type was described by R.J. Lloyd in a series of manuscripts
published from 1890 to 1901. Notably, he was one of the first researchers to state the
importance of the first two formant frequencies for vowel identification. And, in
opposition to current views at the time, Lloyd felt that the absolute value of these
formants’ frequencies was not what distinguished vowels but that the relative value (or
ratio) between formant peaks were essential to identifying vowels. He felt that this had to
be so, since there was inherent variability in the physiology of the vocal tracts across
different speakers, which would invariably lead to differences in the resonant properties
of their vocal tracts. The ratio between the two formants, therefore, had to be the
constant cue that helped the listener to perceive the correct vowel sound (See MacMahon
(2007) for a summary of R.J. Lloyd’s work).
Since Lloyd’s time, other researchers have proposed similar theories based on
formant ratios. Some have focused more on the perceptual side by manipulating
recordings of vowel productions to observe the effect that this has on listeners’
perception. For example, Chiba and Kajiyama (1958) lowered or raised the sampling rate
of vowel recordings during playback. They found that a specific stimulus would be
identified as the same vowel, as long as each of its formants fell within a specific
32
frequency range. This kept the ratios relatively constant, which they claimed led to the
constant perception of the listener. Other researchers have taken an ideal-observer
approach – determining the acoustic cues that lead to optimal classification performance.
Typically, these studies identify the frequency of the first three formants and, then, plot
the ratio between F1/F2 on one axis and the ratio between F2/F3 on the other axis (which
looks very similar to Figure 2) in the hopes that this will clean up the space and clearly
discriminate between vowel groups. In either case, it is assumed that the ratio remains
constant and so creates a specific frequency pattern. This pattern is what is important and
unique for each vowel sound. Vowels are, therefore, identified via this pattern of
stimulation that would be represented on the basilar membrane, even if the starting point
of the pattern was displaced in frequency.
However, formant ratios are not successful at categorizing all vowel groups. For
instance, Potter and Steinberg (1950) found that while the F1/F2 ratio was successful in
discriminating between front vowels, it was not very successful with back vowels.
Because of this, researchers have attempted to update the formant ratio theory. Peterson
(1952; 1961) tried to add a third dimension that could further disambiguate the vowel
space. In addition to the formant ratios, he used measurements of amplitude differences
and then, also tried to use measurements of the fundamental frequency. Although these
added to the discrimination ability for some of the vowels, neither was completely
successful with the entire vowel set used in the analysis. Katz and Assmann (2001) ran a
study just on fundamental frequency, finding that its relationship with vowel
identification was not as straightforward as hoped. Halberstam and Raphael (2004) came
33
to the same conclusion after running a number of perceptual studies investigating the
importance of the fundamental frequency and the third formant for vowel identification.
Miller (1989) actually proposed a new, more complex, formant ratio theory with a threestage process that converted the acoustic signal into phonetic categories. This was done
by taking into account an additional variable that he referred to as the “sensory
reference.” This reference point takes into account the average pitch of all talkers, the
average pitch of the current talker, the ratio between the fundamental frequency of the
talker and the first formant of the vowel that is being perceived, and the fluctuations in
pitch that can occur throughout a production. Miller (1989) found that by including this,
there was less overlap between the back vowels that were of concern in the previous
papers.
Researchers have also started to look for evidence of intrinsic normalization based
on knowledge of the discharge patterns of different types of neurons. Sussman (1986)
proposed a model of vowel perception based on research demonstrating that mustached
bats have neurons sensitive to different frequency combinations. If humans have similar
neurons that respond maximally to the relationship between different frequencies, these
could provide evidence in support of formant ratios. More recently, researchers have
started to record human neural responses to different frequency combinations to see if the
brain is sensitive to formant ratios. Jacobsen, Schröger, and Sussman (2004) found
evidence that it is using ERPs and Monahan and Idsardi (2010) found similar results
using MEG. Both concluded that the formant ratio theories are still worth pursuing.
34
Aside from formant ratio theories, there are also theories proposing that just the
differences between some combination of the formants and the first formant and the
fundamental frequency are the intrinsic variables used by the listener. Traunmüller
(1981) has suggested that it the distance between f0 and the first formant plays an
important role in vowel perception, especially when judging the “openness” of a vowel.
The perceptual vowel model proposed by Syrdal & Gopal (1986) also includes the
difference between f0 and F1, as well as between F3 and F2. While the f0-F1 difference
is said to distinguish vowel height, the F3-F2 difference distinguishes between the
different places of articulation.
Intrinsic theories may be able to account for some of listeners’ ability to identify
vowel sounds, but they cannot account for all of it. No intrinsic theory has been able to
correctly categorize all vowel productions from multiple talkers. While researchers are
still pursuing this line of research, there is no question that extrinsic information plays a
role in vowel perception, as well. That is, the acoustic information that precedes a vowel
can influence the perception of that vowel. For instance, it has been shown that listeners’
perception of the same sound can be affected by a change in talker of a preceding phrase.
Extrinsic theories attempt to explain why this occurs.
1.2.2 Extrinsic Theories
“…if the formants differ, the sounds are not alike. If, then, listeners
hear [three vowels with differing F1 and F2 values] as the same vowel, it
is not because of any evidence in the sound. Therefore the identification is
based on outside evidence. If this outside evidence were merely the
35
memory of what the same phoneme sounded like a little earlier in the
conversation, the task of interpreting rapid speech would presumably be
vastly more difficult than it is. What seems to happen, rather, is this. On
first meeting a person, the listener hears a few vowel phones, and on the
basis of this small but apparently sufficient evidence he swiftly constructs
a fairly complete vowel pattern to serve as background (coordinate
system) upon which he correctly locates new phones as fast as he hears
them.” (Joos, 1948, p. 61)
This notion that listeners are able to extract or learn information about a speaker
that will help them identify subsequent productions from that speaker defines extrinsic
approaches. Joos (1948) continues to expound upon this theory in his monograph on
acoustic phonetics. He speculates that listeners must be creating a coordinate space by
identifying the extreme or corner vowels of the particular talker and identifying any other
vowel productions by placing them in this space. Additionally, he proposes that these
acoustic parameters are transmitting articulatory information, rather than purely acoustic
information. This idea that the listener is learning about the speaker’s vocal apparatus
and articulatory patterns is still debated today. On one side, researchers assume that the
acoustic information for a particular speaker is all that is necessary for talker
normalization. While on the other side, some theorists suggest that the listener extracts
information specific to the articulations that the speakers uses to produce each speech
sound. As a result, there are two classes of extrinsic theories – those that claim the
listener learns about the acoustics of a speaker and those that claim the listener is learning
36
about the articulations or anatomy of the speaker. In the former case, the listener is
extracting an acoustic reference by which other productions will be based. In the latter,
the listener is obtaining information about the vocal tract of the specific talker.
Ladefoged and Broadbent (1957) set up one of the first studies to test the theory
Joos (1948) proposed. They created six different talkers by manipulating the first and
second formant of the synthesized phrase: “Please say what this word is.” Each carrier
phrase was played before one of four test or target words that had one of four different
vowel sounds in a bVt context. Listeners were then asked to identify the target word as
“bit”, “bet”, “bat”, or “but.” The hypothesis was that if listeners did map out a vowel
space or create a coordinate system based on their knowledge of the speaker, their
identification of the target word would shift for different talkers (i.e., different
extrapolated vowel spaces). That is exactly what the authors found. For example, there
was a significant effect found between the carrier phrase that had both a raised F1 and F2
and the carrier phrase where the F1 and the F2 were both lowered. After hearing the
carrier phrase with the raised formants, listeners labeled one of the target words as “bit”
97% of the time. That same target was labeled as “bet” 95% of the time after the carrier
phrase with the lowered formants. This provided clear evidence that extrinsic
information can affect listeners’ perceptual judgments of vowel sounds. Unlike Joos, the
authors did not explore a theory of articulatory gesture but maintained that listeners were
creating a vowel space based on the first and second formant like the one shown in Figure
2. Any ambiguous vowel sound’s formants would be plotted in this space and identified
as the vowel that sits in that relative space.
37
Many researchers have followed in the path set by Ladefoged and Broadbent
(1957) and maintained that listeners are creating some kind of referent based on
acoustics, although not always agreeing on what the referent actually is. For instance,
Dechovitz (1977a) ran a study that used natural speech productions from an adult male
and a nine-year-old male, instead of synthesized speech, and found an effect of talker on
the perception of target stimuli. These results were used in support of a theory of vowel
space mapping. Van Bergem, Pols, and Beinum (1988) ran a similar study with natural
speech (Dutch recordings from a female child and male adult) and also found an effect of
talker. Unlike Dechovitz (1977a), though, they concluded that a vowel space theory did
not explain their results as well as a theory that relied on templates. Therefore, the
acoustic referents that they proposed were average vowel templates that the listener had
for men, women, and children. Once the listener categorizes the speaker as one of the
three based on pitch and timbre, they use that specific template to identify the speaker’s
vowels. Verbrugge, Strange, Shankweiler, and Edman (1976) took a slightly different
approach. Based on a hypothesis that Joos (1948) proposed, they tested whether
exposure to the point vowels (/i/, /ɑ/, /u/) aided listeners’ perception of target sounds for a
specific talker. The logic behind this was that the point vowels have the most extreme
formants, and so would be the best candidates for setting up the vowel space. They found
that listeners made fewer vowel identification errors when the target was heard within a
sentence, rather than after a series of point vowels. This led the authors to believe that
listeners were extracting information about the rhythm of a particular talker’s speech and
not necessarily creating a vowel space. Ainsworth (1972; 1974) also found evidence that
38
the rhythm of a preceding carrier phrase could affect perception of a subsequent vowel
sound. This was suggested as more of a secondary effect after frequency, since the
durational effects were the largest for center vowels that bordered two different vowel
categories. Assmann, Nearey, and Hogan (1982) found that listeners’ vowel
identification improved when they heard stimuli blocked by speaker, rather than in a
mixed-speaker condition. This supports the more general notion that the more experience
a listener has with a specific speaker, the better the listener is at understanding that
speaker’s productions.
Not all of these acoustic-referent based studies have been successful, though.
Dechovitz (1977b) ran one that used productions from two male adults and this time, did
not find an effect of talker. Darwin, McKeown, and Kirby (1989) also ran a series of
studies that used the same set-up as Ladefoged and Broadbent (1957) and were not
always able to elicit an effect, either. Even when effects are found, researchers propose
different cues to account for the normalization (as demonstrated above). And while it
cannot be denied that prior exposure to a listener changes the perception of a subsequent
target, the explanations behind it is not as clear cut as the original vowel-space proposal
from Ladefoged and Broadbent (1957).
As mentioned previously, the acoustic-reference theories are one of two types of
extrinsic theories. The other type more closely resembles Joos’ original proposal (1948)
and assumes that listeners’ reference is not based on the acoustics but on the vocal tract
dimensions or the articulation patterns specific to the speaker. Upon exposure to a talker,
listeners are able to extract information about that talker’s specific vocal tract anatomy
39
and physiology and articulatory patterns. Halle and Stevens (1962) proposed a two-stage
model. In the first stage, the acoustics are converted into vocal tract parameters that
would define each phoneme of a particular talker, thus removing any variability due to
vocal tract anatomy and physiology. In the second stage, these parameters are
transformed into the actual perceived phonemes. This transformation would take into
account other potential talker differences, such as dialect, accent, speaking rate, and any
other idiosyncratic speech habits. McGowan (1997) and McGowan and Cushing (1999)
mathematically tried to implement something similar showing how listeners could take
the acoustic signal to derive the vocal tract characteristics of a particular speaker.
However, Nordström and Lindblom (1975) suggest that normalization occurs in a simpler
fashion. First, a listener uses the third formant of open vowels to estimate the vocal tract
length, since the third formant has been shown to have a relation to length (Lindblom, B.,
& Sundberg, J. E., 1971). Then, the difference between the speaker’s length and the
known average for a speaker of the same age and gender is used to extract a scaling
factor which will normalize for any differences between the particular talker and the
“average” talker. Here, vocal tract length is used to adjust perception, rather than being
the way (via articulatory patterns) that perception is actually achieved. A rather different
study was that of Remez, Rubin, Nygaard and Howell (1987). They replicated
Ladefoged and Broadbent (1957) using sine-wave speech for both the carrier phrases and
the targets, finding very similar results to the original and demonstrating that sine-wave
speech can be perceived phonetically. They claimed that the changing formant patterns
provided substantial information from which the listener could extract vocal tract
40
dimensions. (See Ainsworth (1975) and Nearey (1989) for experiments that pit the
intrinsic and extrinsic theories against one another.)
1.2.3 Other Approaches
While most talker normalization theories fall into one of the above categories,
there are a few that do not, because they do not agree that there is a need for
“normalization.” Magnuson and Nusbaum (2007) argue against both intrinsic and
extrinsic theories, claiming that both assume that once the differences between talkers
have been accounted for, there is a direct correspondence between the source of
information and phonetic identification. This suggests that the ability to perceive
phonemes relies on a passive system capable of discarding the variance between talkers
to get to the information that is necessary for vowel identification. Magnuson and
Nusbaum (2007) assert that talker variance is not so easily disposed of and is actually
used by listeners to modify their perception. The authors propose that talker
normalization is done via an active system that can adjust how the incoming acoustic
output is perceived based on knowledge and expectations of the speaker. For example,
one of the experiments in Magnuson and Nusbaum (2007) manipulated listeners’
expectations by informing them that the stimuli they were about to hear was either from
one speaker or two different speakers. Results showed that expectations did play a
difference in the outcome of the experiment. Reaction times were longer when
participants believed the stimuli were from two different talkers than when they believed
they were from just one talker. Fenn, et al. (2011) found similar results when they tested
listeners’ ability to detect a change of voice in a telephone conversation. Unless the voice
41
change was in between genders, most listeners did not notice a change in talker without
receiving prior information that there would be more than one talker. Glidden and
Assmann (2004) ran a different study, though, on the influences of gender expectations
by using visual stimuli matched with acoustic stimuli. They found that the perception of
a target sound changes, depending on whether the listener is watching a female or male
speaker produce the sound. These results show that listeners’ perception can be
influenced by their expectations (or top-down knowledge) and is not only influenced by
the acoustic input.
Goldinger’s (1998) theory of talker normalization is based on an episodic account
of listeners’ ability to understand a wide range of talkers. Similar to Magnuson and
Nusbaum (2007), this theory does not actually call for “normalization” but instead claims
that listeners store each token of every word that they hear in memory. When a word is
heard by a familiar speaker, many of the tokens will be activated for that word and so,
that word will be identified. When a word is heard by an unfamiliar speaker, tokens for
that word will still be activated but to a lesser degree. Because there will be fewer tokens
that are similar to the word for a novel speaker, there will be a longer reaction time for
identification. This theory, however, cannot predict, nor explain, that a preceding carrier
phrase can shift a listener’s perception of a subsequent target vowel. The units that are
stored in memory are at the word level; therefore, the same word should have the same
identification, as no prior information is taken into account. If this was the case,
listeners’ perception of the same target should not change. For example, in the case of
Ladefoged and Broadbent (1957), the ambiguous “bit/bet” target should have always
42
been perceived as either “bit” or “bet.” Further, upon hearing the same ambiguous target
over and over, listeners’ memory of that word should have become stronger and stronger,
thus solidifying its identification as the same percept. The exemplar model would have
to change the size of the unit to be able to explain extrinsic effects. This could very
quickly run into problems, though. If the size of the unit was extended to include several
words, then would the listener need to have heard those words in that specific order to
understand them? Due to the infinite number of ways words could be grouped together,
this idea does not seem like a plausible explanation. Thus, it appears that the exemplar
theory cannot fully explain how listeners understand many different talkers. As
mentioned previously, though, even the predominant theories on extrinsic normalization
fall short. The next section discusses a relatively novel approach based on general
auditory interactions in perception.
1.3 Long-term Average Spectrum (LTAS) Model
As mentioned previously, a substantial change in talker has not always led to a
change in perception of a target phoneme. Dechovitz (1977b) was not able to find an
effect of talker when he used stimuli from two male adults, even though the speakers’
vocal tracts differed greatly in size and shape. Darwin, McKeown, and Kirby (1989) also
failed to demonstrate talker-based shifts when they presented listeners with filtered
carrier phrases and asked them to identify a target at the end. Why is it, then, that
listeners’ judgment of vowels does not always change as a function of speaker?
A possibility extends from the work of Holt (2005; 2006). These experiments
investigated the effects of non-speech precursors on the identification of a target speech
43
sound. Specifically, listeners were presented with a sequence of tones followed by one
target speech sound from a series that changed perceptually from /dɑ/ to /gɑ/. Listeners
were asked to identify whether the target sound was a /dɑ/ or a /gɑ/. One of the main
cues that listeners use to distinguish between these two sounds is the transition of the
third formant. If the transition starts high, the speech sound is typically identified as /dɑ/.
If the transition starts low, the perceived sound is /gɑ/. Holt (2005; 2006) found that
listeners’ boundary along the series between /dɑ/ and /gɑ/ shifted as a function of the
mean frequency average of the preceding tone sequence. If the tone sequence had a high
average above the F3 of the target, listeners heard /gɑ/ more often. If the tone sequence
had a low average below the F3, listeners heard /dɑ/ more often. In fact, this result was
predicted from previous research on phonetic context effects (Lotto & Kluender, 1998;
Holt, Lotto, & Kluender, 2000; Lotto & Holt, 2006), which showed that a preceding
high-frequency sound (whether a formant, tone or noise band) can cause the perceived
formants of a following speech sound to be effectively lowered. In this case, a /dɑ/
would become a /gɑ/, since a /gɑ/ has a lower frequency F3 – just as Holt (2005; 2006)
found. Even further, these effects can be predicted by calculating the spectral average
across the tone sequence precursor and comparing it to the spectral average of the target
(particularly in the region of an identifying cue like the F3 for /dɑ/ to /gɑ/). This suggests
that listeners may be calculating some kind of average during the precursor and using this
average when making perceptual judgments.
This type of effect is known as a spectral contrast effect – the perception of a
previous stimulus affects the perception of a current stimulus in a contrastive manner –
44
and different types of contrast effects have been found in other perceptual domains. For
example, when participants are asked to pick up a weight and judge how heavy it is, their
perception changes, depending on whether they picked up a heavier or lighter weight
beforehand. If they picked up a heavier weight, the target weight seems lighter and if
they picked up a lighter weight, the target weight seems heavier (Di Lollo, 1964).
Similar types of effects on color perception have also been found in the visual domain
(Chevreul, 1839). It seems that contrast effects are a general property of our perceptual
systems. In the auditory domain, this spectral contrast effect appears to cause changes in
speech perception.
Holt’s (2005; 2006) demonstration of this with a preceding tone sequence
followed by a target is provocatively similar to the paradigm of Ladefoged and Broadbent
(1957). In both, listeners are presented with a preceding stimulus (tones or speech)
followed by a speech sound that they are asked to identify. And, in both, listeners’
perception of the target changed as a function of the content of the preceding stimulus.
Since these tasks are so similar, it is tempting to suggest that the same mechanism can
account for both of them. That is, listeners could be calculating the average energy at
each frequency during the precursor (known as the long-term average spectrum or LTAS)
and using this as a referent for target perception. If the LTAS of a particular carrier
phrase has a peak in spectral energy close to the frequency of a target speech sound’s
formant (and this formant is important for the identification of that particular target), then
the carrier would be predicted to have an effect on the target’s perception. Alternatively,
if the LTAS does not show a peak close to a formant, then it could be predicted that the
45
carrier phrase would not have an effect on the target’s perception. Additionally, the
LTAS comparisons would not only show when a target’s perception would be affected,
but also how the target’s perception would be affected. Like in the studies of Holt (2005;
2006), the tone sequence with a high average would be expected to lower the perception
of the third formant in the “da” to “ga” series and so was predicted to result in more “ga”
responses. By comparing the LTAS of a precursor and a target, one is capable of making
accurate predictions of how a particular target sound will be perceived.
This novel approach has the potential to explain the results of past extrinsic
normalization studies that use this paradigm where a listener hears preceding context
(whether they are tones or speech) and asked to identify a target speech sound at the end.
LTAS comparisons between the carrier phrases and the target should reveal that there
was a difference in the frequency of the peaks in spectral energy between the carrier
phrases in a frequency region close to one of the formants of the target. For example, the
main manipulation used by Ladefoged and Broadbent (1957) was to lower or raise the
first and second formant across the phrase “Please say what this word is.” For one of the
phrases, they raised both the first and second formant. For another of the carrier phrases,
they lowered both of the formants. The target sound they used was an ambiguous speech
sound identified as “bit” half of the time and as “bet” half of the time. When listeners
heard this target after the carrier phrases, they heard “bit” more often following the
carrier phrase that had raised formants. These results could be predicted from an LTAS
account. An important cue for listeners when making a distinction between a “bit” and a
“bet” is the first formant. If the first formant is low, listeners will identify the sound as
46
“bit.” If the first formant is high, listeners will identify the sound as “bet.” The carrier
phrase with the raised formants would have had more spectral energy in the higher
frequencies. This could have effectively lowered the perception of the first formant in
the target and so would explain why the carrier phrase with the higher formants resulted
in more “bit” responses. This model is, therefore, capable of predicting and explaining
the effects that a carrier phrase can have on target perception.
These comparisons may also be able to explain when shifts in target perception
did not occur. It is possible that a preceding carrier phrase would not have energy close
to a target’s formants and so would not be predicted to shift listeners’ perception. This
suggests that the spectral content of the carrier phrase is determinant of whether an effect
is found, not the speaker of that phrase. This means that a change in talker will not
necessarily result in a change in perception. Only those changes in talker that result in
changes of the spectral average (around a formant) will cause a shift in target perception.
This provides an alternative explanation for experiments that found no difference
on target perception when the carrier phrase was produced by different talkers (Dechovitz
1977b; Darwin, McKeown, & Kirby, 1989). For example, the carrier phrases used in
Dechovitz (1977b) were produced by two male talkers who had “substantially different
vocal tract dimensions” (pp. S39). While the differences between the speakers are not
explained further, it could have been that they were still very similar (both being male
speakers) that any theory would predict that the target would not have been perceived
differently. For instance, a theory based on vowel-space mapping would claim that their
vowel spaces were very similar and so, would predict similar perception of the target.
47
From the perspective of the LTAS account, though, the lack of an effect would have been
because there was not a substantial difference in the LTAS of the two carrier phrases that
would cause a change in perception. Without the original stimuli, the ability of the LTAS
model to explain past results cannot be tested. However, it does present an appealing
alternative account for why talker normalization effects have not always been found and
can be used to make testable predictions in future studies (as this research aims to
demonstrate). These predictions can, in fact, tease apart the differences between the
predominant theories and this new LTAS theory.
Broadbent and Ladefoged (1960) actually proposed something similar to the
LTAS account when they revisited their theory of talker normalization. They changed
their explanation to one that was based on a theory of adaptation level (Helson, 1964).
Essentially, if listeners hear a series of sounds or a phrase with a high first formant, then
the “adaptation level” (or average F1 value for that speaker) would be set high. When the
listener then hears a target word, its perceived F1 value would be judged based on this
immediate exposure that the listener had with that talker and the word would be identified
as the word with the lower F1 (so, in the case of “bit” or “bet” it would be perceived as
“bit”). This theory is closer to an LTAS account than their previous theory, which
proposed the listener was making an acoustic-phonemic map for each talker. More recent
findings of Watkins (1991) and Watkins and Makin (1994; 1996) have also leaned
towards the need for a new model that could account for effects of extrinsic talker
normalization. They have used spectrally distorted and filtered carrier phrases and shown
that these affect vowel perception differently, depending on their spectral content.
48
However, neither of these groups has developed a full theoretical model that can provide
an explanation for effects of changes in talker. The purpose of the research discussed in
this dissertation proposes and hopes to develop a novel, quantitative model that is capable
of predicting effects of talker normalization.
49
CHAPTER 2
GENERAL AUDITORY MODEL
2.1 Limitations of Previous Stimuli
While previous research on talker normalization has unquestionably helped
further the development and knowledge of the field, the stimuli used in many of the
experiments limits the conclusions that can be drawn. One type of stimulus set used
follows the work of L&B (Ladefoged and Broadbent (1957) will be referred to as L&B
from this point forward): A stimulus or stimulus set is recorded from a speaker and then,
synthesized and manipulated, potentially in ways that are not possible with a human
vocal tract. For example, six versions of the carrier phrase “Please say what this word
is” from L&B’s study were created to sound like each one was produced by six different
talkers. This was done by manipulating one or both of the formants up or down in
frequency from their original frequency value. The problem with stimuli of this type is
that the manipulations imposed may not be realistic for what a speaker is physically
capable of with his/her vocal tract, and so are unconstrained by what is anatomically and
physiologically possible. To investigate talker normalization, it would be more beneficial
to have stimuli that are accurate representations of different talkers’ productions. This
would help substantiate the interpretation of results from perceptual experiments, since
the stimuli would more closely resemble the real-world input that a listener receives.
This leads to the other stimulus type – actual recordings from multiple talkers that
are not synthesized or manipulated but left in their raw form. While these are obviously
50
realistic and “real-world,” it is hard to control for the types of differences between talkers
and know what the differences actually are and how they affect the acoustic output. Even
if dialect/accent and gender are controlled, there are still variations in the anatomical
structure of the vocal tract and idiosyncratic production styles that remain unknown.
What is gained in terms of ecological validity is at the expense of control and knowledge
of the important variables. It would require a great deal of imaging and articulatory
measures to know the true anatomical differences that exist between a set of talkers.
Further, even with this knowledge, it would be difficult to match other aspects of speech
production like the speaking rate or timing of the articulations and the source or pitch
variation over time. What is needed is a stimulus set that is realistically related to talker
differences but, at the same time, provides controllable and knowable talker differences.
2.2 TubeTalker
One way to create stimuli that are both controllable and realistic, and thus avoid
the problems of the previously mentioned stimulus sets, would be through the use of a
computational model that could control both the source and filter of a modeled vocal
tract. Such a model would need to have two specific characteristics. The first would be
that the model be based on analyses of real vocal tracts. That is, the model would have to
function and be constrained in the same manner as real vocal tracts are during production.
The second characteristic would be that the model allows the user to have complete
control over all aspects of the vocal tract (source, size, shape, length, articulatory pattern,
etc.). With such a model, different talkers could be created in a realistic and controllable
way. Any changes in the formants or acoustic output would be based on actual
51
modifications that vocal tracts can undergo and not post-hoc adjustments to an already
recorded stimulus. Further, the cause for any acoustic variation between talkers would be
known, since the user would know which parameters were set differently. For example,
if two talkers were created that had all parameters set equal except for their length, then
length would be the reason for any differences in the acoustic output. Such a
computational model exists and is known as the TubeTalker (Story, 2005a; Story &
Bunton, 2010).
2.2.1 TubeTalker Description
In its simplest form, speech production can be thought of as two separate
components: the source and the filter (Fant, 1960). The voicing source provides the
energy or activation of the system via an interaction of the respiratory system and vocal
folds. The filter is the shape of the vocal tract, or, rather, the airspace created by the
articulators above the glottis, essentially a shaped “tube” through which the source
passes. Each possible shape of the filter or tube results in a set of resonances or formant
peaks. These different sets of resonant peaks help define different speech sounds.
During speech, the source and filter work together in a continuously changing manner.
The source is either creating voiced or voiceless sounds (the vocal folds are either in
vibration or not). The vocal tract is moving fluidly through various airspace shapes (the
articulators are in constant motion). The coordinated control of both produce a string of
speech sounds that a listener is able to decode as a meaningful message.
The source-filter theory is one of the underlying principles behind the TubeTalker
synthesizer. The source of the TubeTalker is based on a kinematic model of the vocal
52
folds, which allows manipulation of a number of parameters: amount of adduction at the
arytenoid processes, shape of the glottal opening at rest, phase and amplitude of the
movement of the upper and lower portions of the vocal folds, thickness of the vocal folds,
length of the glottal opening, and fundamental frequency (Titze, 1984). The effects of
the filter are computed based on a wave-reflection model that takes into account how the
acoustic waves interact with physiological aspect of the vocal tracts, such as the skin and
side cavities (e.g., the piriform sinuses) (Story, 1995).
The filter can be thought of as a tube system made up of 44 mini-tubes or
“tubelets.” Each tubelet’s length and cross-sectional area can be specified individually so
that the model can take into account articulatory movements like lip protrusion or a
lowering of the larynx. The length of each tubelet can also be set to create more general
differences. For example, keeping the cross-sectional area constant, while changing the
length of each tubelet to be longer or shorter, can create different “vocal tracts.” This
essentially changes the length of the tube and would cause a change in perceived talker,
even if it is constricted and expanded in the exact same way as another tube. On the
other hand, the cross-sectional area could be manipulated while the length is maintained
to create vocal tract shapes that differ in cavity size of the oral or pharyngeal regions.
This vocal tract synthesizer gets around the problems that were mentioned before
by allowing for the creation of multiple talkers with known vocal tract features and
articulatory patterns. An underlying tube shape can be made by setting the starting crosssectional area and length of the tubelets and then can be constricted and expanded to
53
create different speech sounds. All of these parameters can be controlled and the effect
that they have on perception can be examined.
2.2.2 Development of the TubeTalker
As its name implies and explained above, the TubeTalker can be thought of as a
tube system. The tube can be constricted or expanded to match the cross-sectional area
function of a vocal tract (or, rather, the airspace within the vocal tract) that produces that
same sound. The cross-sectional area functions that the model uses were based on MRI
images of sustained phoneme productions from a male vocal tract (Story, Titze, &
Hoffman, 1996). Analyses from these images allowed for the development of a
hierarchical system that incorporates three tiers to produce continuous, time-varying
speech (Story, 2005a). Basically, the model starts with a “neutral” vocal tract shape (Tier
1). This shape is then modified to create particular vowel sounds (Tier 2) and then,
consonant constrictions are added to the underlying vowel-to-vowel transitions at specific
points in time (Tier 3) to create synthesized speech.
The vowel shaping in the model is controlled by two parameters that are the first
two principal components derived from the area functions for ten of the vowels from
Story, Titze, & Hoffman (1996). These analyses showed that the mean shape of the area
functions for the vocal tract approximated a uniform tube (the neutral vocal tract shape).
This neutral shape or starting point is Tier 1. Story and Titze (1998) suggest that the
acoustic output of this shape is the neutral vowel sound and that this neutral shape is
“perturbed” during speech production to create other speech sounds. To create vowels,
this shape is perturbed by two “modes.” The modes can be thought of as two different
54
tube shapes whose values can be set independently of one another to create a combined
shape with unique resonant properties. The first mode accounted for 75% of the variance
in area functions and when its value is increased, it raises the frequency of F1 but lowers
the frequency of F2. The second mode accounted for 18% of the variance and when its
value is increased, it raises the frequency of both F1 and F2. From an articulatory
standpoint, the first mode could be thought of as controlling the upward/downward and
backward/forward movement of the tongue and some of the upward/downward
movement of the jaw. The second mode has the most influence on the middle of the
vocal tract (tongue arching and upward/downward motion), as well as controlling lip
rounding. Using the modes to change the shape of the neutral vocal tract and create
vowels is Tier 2.
The third tier adds the consonant constrictions to the underlying vowel component
of tier two (also based on analyses from Story, Titze, & Hoffman (1996)). These can be
defined in a number of ways (e.g., location and area) that can be constrained by how and
where the articulators constrict the air space of the vocal tract for real speech productions.
Once these are configured, the voice source described above is added to the system to
create synthesized speech.
2.2.3 Creating Different Talkers
The parameters for the neutral vocal tract shape can be modified by adjusting the
length and cross-sectional area of the tubelets mentioned previously. The standard vocal
tract created by default settings of the TubeTalker model is that of a male with a vocal
tract length of 17.5cm. To create different speakers one can adjust the length to be longer
55
Figure 3. A pseudo-midsagittal view of the neutral vocal tract shapes for the five
“talkers” created from the TubeTalker. (Note that the center VT in both the bottom
and top row is the StandardVT. This was done to make it easier to compare the
StandardVT to each variation.) For each vocal tract, the cross-sectional area of each
tubule is plotted so that it is perpendicular to and equidistant above and below the
centerline of the vocal tract. The x-axis is the anterior-posterior (A-P) dimension and
the y-axis is the inferior-superior (I-S) dimension. (Image created by Dr. Brad Story.)
or shorter, while keeping the cross-sectional area constant. On the other hand, the length
can remain constant and the cross-sectional area can be changed. The oral cavity could
be expanded while the pharyngeal cavity is constricted or vice versa. Figure 3 shows the
neutral vocal tract shape for TubeTalkers with these modifications. Once these shapes
are made, parameters can be set for the vowel modes and for consonant constrictions that
will perturb the neutral vocal tract shape. In this way, the model allows for complete
control over synthesized talkers. The TubeTalkers differ in neutral vocal tract shape but
56
all other parameters can be constant – rate of production, vibratory source, pitch contour,
vowel modes, and consonant constrictions. The exact same modifications are happening,
just to different underlying shapes. Thus, the synthesizer solved the issues of the
previous stimuli, because (1) The acoustic output will be changed in realistic ways, since
TubeTalker is based on real vocal tracts and constrained by their physiological limits; and
(2) The articulatory patterns and anatomical structure of the different “talkers” is
controlled, instead of containing unknown variability. The majority of the stimuli used in
this dissertation were created using the TubeTalker model.
2.3 General Auditory Model
A commonality across all of the theories of extrinsic talker normalization is that
the listener is assumed to compensate for differences between talkers by extracting
information specific to the speaker. Whether this was an acoustic-based reference, such
as a speaker’s vowel space, or the actual articulatory patterns for a speaker, the theories
assumed that the listener was using some type of speaker-specific knowledge. If this
were true, a substantial change in talker should have led to a change in vowel perception;
however, this was not always the case. Dechovitz (1977b) and Darwin, McKeown, and
Kirby (1989) were not always able to find an effect as a function of talker. This is a
challenge for these predominant theories of extrinsic talker normalization. Chapter 1
proposed an alternative explanation that stemmed from the work of Holt (2005; 2006).
This explanation is central to the model proposed and explored in this dissertation.
Holt (2005; 2005) found evidence of a general auditory mechanism that calculates
the average of the spectral content of a non-speech stimulus and uses this average when
57
perceiving a following speech sound. It is being proposed here and has been proposed
previously (Lotto & Sullivan, 2007) that listeners calculate the average of the spectral
energy of a preceding speech phrase, as well. This average (referred to as the long-term
average spectrum or LTAS) is then used as a referent when identifying speech sounds.
By comparing the LTAS of preceding stimuli (speech or non-speech) with the target
speech sound, one is able to make predictions about which phrases will shift the
perception of the target sound. If a carrier phrase’s LTAS has a peak in frequency that is
close to the cue important for the target’s identification (such as a formant), then the
LTAS could affect the perception of that phoneme. For instance, if the LTAS peak is
lower in frequency than a subsequent target vowel’s F1, it could effectively raise the
perception of that target’s F1 and change the listener’s perception of it – a spectral
contrast effect. This means that LTAS comparisons do not only predict when there will
be an effect on perception, but also, how perception is affected (i.e., in what direction the
target’s perception will be shifted). This was the first idea from which the LTAS model
was based.
The second idea that formed the basis of this model was a finding of Story
(2005b; 2007b). He expanded his original research on vocal tract area functions to six
other speakers, showing that the modes generalize across them. However, while the
modes are very similar, the neutral vocal tract shapes are different – “the modes appeared
to provide roughly a common system for perturbing a unique underlying neutral vocal
tract shape” (Story, 2005b, p. 3855). Thus, the average vocal tract shape appears to be
what separates or distinguishes between different speakers. Lotto and Sullivan (2007)
58
suggested that if a listener can gain access to the acoustic output of the neutral vocal tract
shape (the neutral vowel), they may be able to use this to normalize for differences across
talkers.
The model that I have developed and test in this document assumes that listeners
can, in fact, use the neutral vowel to normalize for talkers. Specifically, it suggests that
listeners estimate the neutral vowel via the general auditory mechanism that calculates
the LTAS. If the average of the different vocal tract shapes is the neutral vocal tract
shape, it may be that the average of the acoustic output of different vocal tract shapes
(speech) resembles the acoustic output of the neutral shape (the neutral vowel). Thus, the
spectral average that listeners calculate over the duration of the carrier phrase could be
similar to the spectral properties of the neutral vowel. Listeners could then be using this
average as the referent by which they base their phonemic judgments. Figure 4 shows
the vowel space of Figure 2, but this time the neutral vowel for each vocal tract is set as
the origin. The variability between talkers is reduced substantially when the neutral
vowel is set as the referent.
If listeners do calculate an LTAS of the carrier phrase and this average resembles
the neutral vowel of that talker, then listeners’ perception of a target speech sound would
only change when the LTAS of the talker’s neutral vowel showed a contrastive effect
with the formant of the target. It is necessary to point out, however, that these two ideas
are mutually exclusive. It may be the case that listeners calculate the spectral average of
a preceding carrier phrase and use it as a referent, but it does not reflect the acoustic
output of the neutral vowel for a specific talker. The opposite could also be true.
59
Figure 4. Plot of the same vowels from 2 with the addition of the
neutral vowel. The graph has been adjusted so that the neutral
vowel is the referent vowel or origin for each speaker.
Listeners could be extracting the acoustics of the neutral vowel and using this to
normalize for different talkers, but they may not be using the LTAS as defined here (see
Section 3.2.4.1 in Ch. 3) to do so. The general auditory model proposed here is not
strictly tied to either of these ideas but rather to the development of a model that can
explain the adaptive perception of speech. If part of the model is found unsupported,
then it will evolve and incorporate these findings into a newer version of itself.
This being said, the model at this point does include both of these ideas (the
LTAS mechanism and the neutral vowel) and based on these, an alternative model of
talker normalization is proposed that has four main hypotheses: 1) during carrier phrases,
listeners compute an LTAS for the talker; 2) this LTAS resembles the spectrum of the
60
neutral vowel; 3) listeners represent subsequent targets relative to this LTAS referent; 4)
such a representation reduces talker-specific acoustic variability.
61
CHAPTER 3
GENERAL METHODOLOGY AND PRELIMINARY STUDIES
3.1 Predictions
Specific predictions can be made based on the four hypotheses stated at the end of
Chapter 2. These predictions are as follows:

Prediction #1: Shifts in perception of a target phoneme can be elicited
when the LTAS of preceding contexts predicts opposite contrast effects on
the target. Similar to the effects found by L&B, a change in talker can
lead to a difference in target perception. According to the LTAS model
and its hypotheses, these effects are general in nature and so are not
limited to target vowel identification but should extend to consonant
identification, as well. Changes in “talkers” created from the TubeTalker
should also be capable of producing these shifts if the change in LTAS for
the talker results in differential contrast with the spectrum of the target.

Prediction #2: ONLY those talker differences resulting in a change in
LTAS that corresponds to a differential contrast of an important phonetic
cue will result in a shift in perception. If these effects are not due to
talker-specific normalization processes but strictly based on contrastive
effects between the LTAS of the carrier phrase and the target, then it is
possible that a substantial change in talker (whether the talker is real or
created with the TubeTalker synthesizer) may not affect subsequent target
62
perception. The LTAS for two (or more) talkers could fail to lead to
differential contrast with the target spectrum, resulting in no perceptual
effects.

Prediction #3: Backward stimuli should cause the same effect as the
stimuli played forward. If the calculation of the LTAS does serve as the
referent and targets are identified relative to it, then stimuli that have the
same exact LTAS should cause the same exact effect on target perception.
Therefore, playing stimuli in reverse should have the same effect as when
the stimuli are played forward, since the LTAS of both is the same.

Prediction #4: The LTAS that is calculated is an extraction of the neutral
vowel; therefore, a talker’s neutral vowel should have the same effect on
perception as a phrase produced by that talker. The neutral vowel vocal
tract shape is what varies between talkers, while the articulatory patterns
are quite similar (Story, 2005b; 2007b). The LTAS model proposes that
“normalization” occurs because the LTAS of a speaker will come to
resemble the output of the neutral vocal tract. Thus, if listeners use the
LTAS as a referent, individual differences in talker vowel spaces will have
a reduced effect on listener identification. This implies that the LTAS of
carrier phrases should have an effect similar to that of the neutral vowel.
Each of these predictions was specifically tested in a preliminary study to further develop
the model.
63
3.2 General Methodology
3.2.1 Participants
All participants were recruited from the University of Arizona and received either
course credit or monetary compensation for their time. All reported no history of hearing
loss and English as their native language.
3.2.2 Stimuli
Carrier Phrases. TubeTalker was used to create five different “talkers” by
manipulating the underlying or neutral vocal tract (Figure 3). The first vocal tract was
the standard vocal tract (StandardVT). Its length was 17.5 cm (tubelet = .39682 cm).
The other four vocal tracts were all modifications of the StandardVT. The short vocal
tract (ShortVT) was 20% shorter than the StandardVT (14 cm; tubelet = .317456) and the
long vocal tract (LongVT) was 20% longer than the StandardVT (20 cm; tubelet =
.476184). The synthesized vowels from Figure 1 were created with the StandardVT,
ShortVT, and LongVT. The next two “talkers” differed in the neutral shape but were
constant in length. These were created by manipulating the cross-sectional area of the
oral and pharyngeal cavities relative to the Standard VT. One had a constricted
pharyngeal cavity and an expanded oral cavity (O+P- VT) and the other had an expanded
oral cavity and a constricted pharyngeal cavity (O-P+ VT). Some combination of these
talkers was used in the experiments that follow. It is important to note that creating
carrier phrases with the TubeTalker is a complex process. Because of this, phrase
choices were limited to what the author (Story, 2005a) had created at the time of the
development of these experiments.
64
Figure 5. Spectrograms of the endpoints of the the /dɑ/-/gɑ/ target series.
The onset frequency of the third formant is labeled on the y-axis.
Targets. The targets were always a series in which one of the segments gradually
changed phonetic identity. Three different series were used in the experiments. The first
was the 10-step /dɑ/-/gɑ/ series used by Holt (2005; 2006). Each target stimulus was 330
msec long. They were based on natural tokens of /dɑ/ and /gɑ/ produced by a male,
native-English speaker and were created using LPC analysis/resynthesis. The onset of
the second and third formant was manipulated in approximately equal steps from an
unambiguous /dɑ/ to an unambiguous /gɑ/. Spectrograms of the endpoints can be seen in
Figure 5.
The second and third target series were created from one continuous series
synthesized from the StandardVT. This target series consisted of 20 steps that changed
from “bit” to “bet” to “bat.” These targets did not change in equal formant intervals (as is
65
typical when creating a target series) but changed in equal area or articulation steps
(using the two modes described previously). Each member of the series was 250 msec in
length and had a pitch contour that varied from 93 Hz at the /b/ to 100 Hz during the
vowel and then dropped to 87 Hz before the /t/. The first twelve steps were used as a
“bit” to “bet” series (bVt1-bVt12). The first two formants for the vowels in this series
are plotted in a vowel space in Figure 6. This series was used in the majority of the main
experiments; however, in Chapter 5 (Experiment 6) an eight step “bet” to “bat” series
was used. These consisted of the 10th-17th step of the series (bVt10-bVt17).
Figure 6. F1xF2 vowel
space of the vowel
portion (at .082 sec) of
the 12-step “bit” to “bet”
target series.
3.2.3 Procedure
All of the preliminary studies (and the perceptual experiments described in
Chapters 5, 6, and 7) follow a similar paradigm to that of Ladefoged and Broadbent
(1957). The stimuli consist of a carrier phrase followed by a target syllable or word after
a 50-msec inter-stimulus interval. Listeners are asked to identify the target as either one
66
of two options. For a summary of the carrier phrases and target combinations used in
each of the preliminary experiments, see Table 1.
Participants were run in groups of one to four on separate computers in a large
sound proof booth. On each trial, they were presented with a single randomly-selected
Carrier Phrase +Target stimulus over circumaural headphones (Sennheiser HD 280) at
approximately 75 dB SPL. They categorized the target by using a mouse to click one of
two labeled boxes on a monitor (Figure 7 shows an example of the screen that
participants saw during experimentation). The next trial did not begin until the
participant responded. The stimuli were split into blocks with a certain number of
stimulus repetitions in each block. This gave participants the opportunity to take short
breaks between blocks if needed. The entire session, including consent and debriefing,
took no more than one hour. Stimulus presentation and data collection were controlled
by the ALVIN software program (Hillenbrand & Gayvert, 2005).
3.2.4 Predictions
Predictions for each experiment were derived from comparisons of the average
spectral energy for each carrier phrase with the spectrum of the ambiguous member of the
target series (bit/bet = bVt5, bet/bat = bVt13, da/ga = dg5). Appendix A explains how
the ambiguous targets were determined. Each of the predictions was based on a specific
frequency region, depending on the target series being used. For the “bit” to “bet” and
“bet” to “bat” series, the LTAS comparisons show the frequency region around the first
formant (0-1200 Hz). This is because the first formant is the important cue for
distinguishing between these three vowels (/ɪ, ɛ, æ/). An /ɪ/ has a lower first formant than
67
Pred.
1
Exp.
1
2
2a
2b
3
3a
3b
4
4a
4b
Carrier Phrase
(1) ShortVT – “Abra…”
(2) LongVT – “Abra…”
(1) ShortVT – “He had a…”
(2) LongVT – “He had a…”
(1) O+P-VT – “He had a…”
(2) O-P+VT – “He had a…”
(1) ShortVT – “He had a…”
(2) LongVT – “He had a…”
(1) ShortVT – “He had a…”
(2) LongVT – “He had a…”
(1) ShortVT – Neutral Vowel
(2) ShortVT – “Please say…”
(1) ShortVT – Neutral Vowel
(2) ShortVT – “Please say…”
Duration
(1) 1.4 sec
(2) 1.4 sec
(1) 1.3 sec
(2) 1.3 sec
(1) 1.3 sec
(2) 1.3 sec
(1) 1.3 sec
(2) 1.3 sec
(1) 1.3 sec
(2) 1.3 sec
(1) 400 msec
(2) 1.7 sec
(1) 400 msec
(2) 1.7 sec
FW/BW
Targ. Series
FW
/dɑ/ - /gɑ/
FW
/bɪt/ - /bɛt/
FW
/bɪt/ - /bɛt/
BW
BW
/dɑ/ - /gɑ/
FW
/dɑ/ - /gɑ/
Table 1. A summary of the stimuli used in each of the preliminary
studies (Pred. = Prediction; Exp. = Experiment).
Figure 7. An example of the computer screen participants see during
experimentation. The software used to run the experiments is ALVIN
(Hillenbrand & Gayvert, 2003).
68
an /ɛ/, which has a lower first formant than an /æ/. The frequency region used to make
predictions concerning “da/ga” perception is from 1800 to 3000 Hz. This region contains
the third formant, the important cue for /dɑ/-/gɑ/ distinction. A good “da” has a higher
F3 starting frequency than a good “ga,” having a low F3 starting frequency. (See
Appendix B for a continued discussion of why and how these frequency regions were
focused on in making LTAS predictions).
Generally, if the LTAS of the carrier phrase had a trough (low amplitude) in the
region of the ambiguous target’s formant and a frequency peak lower or higher than that
formant, that carrier phrase was expected to have an effect on target perception. Figure 8
shows an example of an LTAS comparison with both of these characteristics. The
carriers are the phrase “He had a rabbit and a” synthesized from both the Long and Short
VT. The ambiguous target is the fifth step in the “bit” to “bet” series produced by the
StandardVT. Note that the ShortVT phrase has a higher frequency average peak in the
F1 region than the target and the LongVT phrase has a peak that largely overlaps with the
target. Based on this figure, the LTAS model would predict that the ShortVT phrase
would cause an effective lowering of the F1 in the target, making it more “bit” like.
Thus, a test of the prediction would be whether there are more “bit” responses for the
ShortVT versus the LongVT contexts. It is important to note that while the spectral
comparisons are referred to as “LTAS” comparisons, the images actually use a
combination of LTAS, linear-predictive coding (LPC) and discrete fourier transform
(dFT) spectral representations. The choices are based on the length and spectral content
of the carrier phrase. As a general rule, LPC is used for stimuli less than one second in
69
Figure 8. LTAS comparison between the carrier phrases “He had a rabbit
and a” of the LongVT and ShortVT and the ambiguous target from the
“bit” to “bet” series (bVt5) in the region surrounding F1. (LTAS =
LongVT, ShortVT; LPC = bVt5).
length. The figure caption for each LTAS comparison will specify which type of
representation is being used for each stimulus.
3.2.4.1 Deriving the LTAS
Signal Analysis. The LTAS provides an analysis of the spectral properties of an
acoustic signal over a longer temporal duration than is typical with a spectrum. The
LTAS separates the signal into a certain number of bins or windows of time. For each
time frame, a spectrum is calculated using a dFT. The average amplitude for each
frequency is then calculated across these time frames. The LTAS can then be plotted in
terms of frequency by amplitude.
70
In the past, the LTAS has been used to determine the distribution of frequency
energy present for hearing or other acoustic detection systems. In many of these cases, it
has been used to describe running speech of male, females, and children in an attempt to
advance hearing aid development (i.e., to determine which frequency regions contain
more energy and may be more important for speech perception) (Dunn & White, 1940;
Byrne, 1977; Cornelisse, Gagnńe, & Seewald, 1991; Byrne, et al., 1994; Pittman,
Stelmachowicz, Lewis, & Hoover, 2003). More recently, it has been used to describe
voice quality differences: male vs. female (Mendoza, Valencia, Muñoz, & Trujillo,
1996), young vs. old (Linville, & Rens, 2001; Linville, 2002), and singing vs. speaking
(Cleveland, Sundberg, & Stone, 2001).
These studies use the LTAS in a descriptive fashion to provide a measure of the
frequency composition of input. The theoretical account proposed in this dissertation
suggests that listeners actually calculate something akin to the LTAS during speech
comprehension. This average is then used to make subsequent perceptual judgments.
This proposal is novel and utilizes the acoustic measure of LTAS as a perceptual
representation in much the same way that the FFT spectrum went from being an acoustic
analysis to the presumed representation of speech perception.
Within this proposal, the LTAS comparisons were computed and graphed in the
MATLAB programming environment (The Math Works, Natick, MA, USA) with inhouse scripts written by Dr. Brad Story. As mentioned, a combination of analyses was
used, depending on which best represented each specific stimulus. The LTAS was
computed by taking the average of non-overlapping spectra extending over the entire
71
duration of the signal. Each individual spectrum in the LTAS corresponded to a 100 msec
Hanning-windowed section of the waveform, and zero-padded to 8192 points prior to
computing an FFT. The LPC spectra were calculated for a 25 msec Hamming windowed
section of the waveform using an LPC analysis with 46 coefficients. The sampling
frequency of all analyzed waveforms was 44100 Hz.
Neural Representation of LTAS. While the LTAS has been used to analyze
acoustic signals and listeners’ ability to calculate it is an assumption of the model
proposed, there are no findings demonstrating a neural representation of the LTAS as of
yet. Sjerps, Mitterer, and McQueen (2011a) ran an EEG study, investigating neural
responses to spectral contrast effects with speech stimuli. They found a response in the
N1 time window, leading them to suggest that these effects occur at a very early stage of
processing (within the first 100 – 200 msec) that is not influenced by conscious
processing of the listener. However, the authors do not give any suggestion of where or
how this is actually being represented and processed in the brain.
3.2.5 Analysis Methods
Depending on the experiment, separate paired-sample t-tests or ANOVAs were
conducted to show whether shifts in target perception were significant. The dependent
variable for the perceptual studies was the percentage of “da,” “bit,” or “bet” responses
collapsed across all series members (for the preliminary studies) and across the five
middle targets (for the main experiments). As the middle of the series contains the more
ambiguous targets, it was determined that collapsing across these is more sensitive to
shifts in target perception than collapsing across the entire series. The five middle targets
72
for the bit/bet series were bVt4-bVt8 and for the bet/bat series were bVt11-bVt15. The
carrier phrase served as the independent variable.
3.3. Prediction #1: Shifts in perception of a target phoneme can be elicited when the
LTAS of preceding contexts predict opposite contrast effects on the target.
3.3.1 Preliminary Experiment 1
The earlier work on spectral contrast by Lotto and Kluender (1998) and Holt
(2005; 2006) used target series that varied from /dɑ/ to /gɑ/ (as opposed to the vowel
targets in L&B). In order to link these two empirical paradigms within one model, one of
the target series used was the /dɑ/-/gɑ/ targets (of the work of Lotto & Holt), following
carrier phrases produced by different talkers (as in L&B) created with TubeTalker. Thus,
the first aim of this experiment was to test whether the effects L&B found with vowel
normalization could be extended to consonant identification, as would be predicted by the
LTAS model. The second aim of the experiment was to validate that the different
“talkers” created with TubeTalker could indeed provide the same types of shifts in target
categorization as L&B demonstrated.
3.3.2 Stimuli and Predictions
LTAS comparisons showed that the phrase “Abracadabra” synthesized from the
LongVT and the ShortVT would have a contrastive effect on the /dɑ/-/gɑ/ target series.
Examining the region around the third formant showed that the LongVT had a frequency
peak lower in frequency than the ambiguous target’s F3 and a trough in the region of the
ambiguous target’s F3. This could push the effective F3 of the target higher in frequency
causing more “da” responses. The ShortVT does not have a peak, though and so would
73
not be expected to have an effect. This comparison is shown in Figure 9. The figure only
shows the F3 regions as the third formant is the important cue for /dɑ/-/gɑ/ distinction.
3.3.3 Participants and Procedure
Seventeen subjects participated in this study. There were 200 trials total (1 carrier
phrase x 2 talkers x 10 targets x 10 repetitions).
3.3.4 Results and Discussion
A significant shift in “da” responses in the direction predicted by the LTAS
comparison (t(16) = 5.07, p < .005) was found between the ShortVT and LongVT (See
Figure 10). The results provide a demonstration of a shift in consonant target perception
as a function of talker in the carrier phrase. This links the paradigms used by Ladefoged
& Broadbent on talker normalization with the work by Lotto and Holt on spectral context
effects. The direction of the shift obtained is consistent with the LTAS comparison.
Figure 9. LTAS comparison between the carrier phrases “Abracadabra” of the
LongVT and ShortVT and the ambiguous target from the “da” to “ga” series
(dg5) in the region surrounding F3 (LTAS = LongVT, ShortVT; LPC dg5).
74
Further, these results provide evidence that these controlled stimuli (with equal settings
for all parameters but length) are capable of eliciting shifts as a function of “talker.”
*
Figure 10. Results for Preliminary
Experiment 1: Mean percent “da”
responses for the LongVT and
ShortVT synthesized versions of
“Abracadabra.” Error bars indicate
standard error of the mean.
3.4 Prediction #2: Only those talker differences resulting in a change in LTAS that
corresponds to a differential contrast of an important phonetic cue will result in a
shift in perception.
3.4.1 Preliminary Experiment 2
An important assumption of Prediction #2 is that not all shifts in talker identity
(even if perceptible) should lead to shifts in target responses. Whether a change in talker
will or will not result in a change in target perception should be predictable from a
comparison of the LTAS carrier and target. That is, different talkers may not have
different effects on target perception if their LTAS differences do not result in differential
contrast effects when compared to the target’s spectrum.
3.4.2 Stimuli and Predictions
The phrase “He had a rabbit and a” (1.3 sec) was synthesized from the four VT
variations (Long, Short, O+P-, and O-P+). The targets consisted of the 12-step “bit” to
75
“bet” series (250 msec each) synthesized from the Standard VT. Based on the LTAS in
the region of the first formant, it was predicted that the ShortVT would lead to more “bit”
responses than the LongVT (Figure 8). The LTAS comparison did not predict any shift
between the O+P-VT and the O-P+VT. The LTAS of each talker has a peak above the
F1 of the ambiguous target, so their effects on perception would be predicted to be
similar (Figure 11).
3.4.3 Participants and Procedure
Sixteen subjects participated in this study. The carrier phrases + targets were
separated into blocks. The Long and Short VT carrier phrases + targets were in Block 1.
The O+P- and O-P+ VT carrier phrases + targets were in Block 2. There were five
repetitions of each stimulus per block and participants went through each block twice
Figure 11. LTAS comparison between the carrier phrases “He had a rabbit
and a” of the O+P-VT and O-P+VT and the ambiguous target from the “bit”
to “bet” series (bVt5) in the region surrounding F1 (LTAS = O+P-, O-P+; LPC
= bVt5).
76
(order counterbalanced). This totaled 480 trials (1 phrase x 4 talkers x 12 targets x 10
repetitions).
3.4.4 Results and Discussion
Results supported the LTAS predictions. Listeners perceived “bit” significantly
more often following the ShortVT’s carrier phrase than when the targets followed the
LongVT’s carrier phrase (t(15) = -5.42, p<.001) and there was no significant difference
between the O+P- and O-P+ VTs (t(15) = .69, p>.05). Figure 12 and 13 show these
results.
*
Figure 12. Results for Preliminary
Experiment 2: Mean percent “bit”
responses for the LongVT and
ShortVT synthesized versions of “He
had a rabbit and a.” Error bars
indicate standard error of the mean.
Figure 13. Results for Preliminary
Experiment 2: Mean percent “bit”
responses for the O-P+VT and O+PVT synthesized versions of “He had a
rabbit and a.” Error bars indicate
standard error of the mean.
77
These results support the LTAS account by demonstrating the power of this
model to predict which talker differences will result in shifts in target perception and
which talker changes will not. It is important to note that this is not the first instance
where a change in talker has not led to a change in target perception. Dechovitz (1977b)
ran a similar study using productions from two different male talkers and did not find a
significant difference in target perception between the two carrier phrases. Darwin,
McKeown, and Kirby (1989) also ran a series of studies investigating talker
normalization and were not always successful in finding an effect of talker. If LTAS
comparisons were made with these stimuli, it would be expected that the carrier phrases
used would not differ substantially in their average energy and no spectral contrast effect
would have been predicted. Thus far, the LTAS account is supported.
3.5 Prediction #3: Backward stimuli should cause the same effect as the same stimuli
played forward.
3.5.1 Preliminary Experiment 3
According to the simplest LTAS account, a stimulus that differs from the carrier
phrase but maintains the same LTAS should result in a similar shift in target responses.
Since the LTAS of forward and backward speech would have equivalent LTAS, they
should result in equivalent effects. This prediction was tested using reversed versions of
some of the carrier phrases. These phrases would not be intelligible but would have
identical LTAS.
78
3.5.2 Stimuli and Predictions
The carrier phrases were the same as in Preliminary Experiment 2 (“He had a
rabbit and a”). However, only the Long and Short VT were used, since their LTAS
comparison showed an effect in the previous experiment. The same 12-step target series
was used (“bit”-“bet”).
3.5.3 Participants and Procedure
Sixteen subjects participated in this study. The stimuli were split into two blocks
(Block 1 = forward stimuli; Block 2 = backward stimuli). There were five repetitions of
each stimulus per block and participants were run in each block twice (order
counterbalanced) for a total of 480 trials (1 phrase x 2 talkers x 2 directions x 12 targets x
10 repetitions).
3.5.4 Results and Discussion
A 2x2 within-subjects ANOVA was used to analyze the data. The factors were
vocal tract (LongVT, ShortVT) and direction (forward, backward). There was not a main
effect of vocal tract (F(1,15) = 3.00, p>.05). There was a main effect of direction
(F(1,15) = 17.09, p<.05) and a significant interaction (F(1,15) = 8.73, p<.05). Planned
comparisons showed that listeners heard “bit” significantly more often following the
ShortVT than following the LongVT (F(1,15) = 5.17, p<.05) when the phrases were
played forward, but there was no significant difference when the phrases were played
backward (F(1,15) = .58, p>.05). Results are shown in Figure 14 and 15.
79
The significant interaction means that the effect size of phrase was not equal for
the forward and backward presentations. The forward effect was larger than the
backward effect. These results conflict with the LTAS theory, as it predicted that a
phrase played backward should have the same effect as that phrase played forward. One
possible explanation is that there is a temporal window over which the LTAS is
computed. That is, only the acoustic information within that window of time is used in
the calculation of the LTAS. The information prior to that has no effect on target
perception. This window may be shorter than the duration of the carrier phrases used.
:
*
Figure 15. Results for Preliminary
Experiment 3: Mean percent “bit”
responses for the LongVT and
ShortVT synthesized backward
versions of “He had a rabbit and a.”
Error bars indicate standard error of
the mean. Significance is marked for
the planned comparison.
:
Figure 14. Results for Preliminary
Experiment 3: Mean percent “bit”
responses for the LongVT and
ShortVT synthesized forward versions
of “He had a rabbit and a.” Error
bars indicate standard error of the
mean.
80
When the stimuli were played backward, the local spectral information (that is, the
average spectral energy closest to the target) may have been different than when it was
played forward, causing different effects in the backward and forward condition. It could
also be the case that there is a temporal window but that the window is weighted,
meaning the spectral information closer to the target has a greater representation in the
LTAS than the spectral information more distant in time. Either one of these would
explain this discrepancy in the results without discounting the LTAS theory. A third
possibility is that carrier phrases without phonetic content do not elicit talker
normalization effects. Whereas the first two possibilities suggest a change in how one
calculates the LTAS for the current model, the last possibility would require a substantial
reformulation, as the model is considered to be a general perceptual process and not
specific to speech.
3.6 Prediction #4: The LTAS that is calculated is an extraction of the neutral vowel;
therefore, a talker’s neutral vowel should have the same effect on perception as a
phrase produced by that talker.
3.6.1 Preliminary Experiment 4
If targets are perceived relative to the LTAS and this LTAS is reflective of the
neutral vowel, then, a carrier phrase and the neutral vowel from the same talker should
have the same effect on the target series. However, if listeners do not extract the neutral
vowel from the speech precursor, then, the neutral vowel and the carrier phrase may
cause different effects or shifts in perception (if an LTAS comparison would predict one).
81
3.6.2 Stimuli and Predictions
Both of the carrier phrases in this block were created from the ShortVT. One was
simply the neutral vowel (400 msec) and the other was the phrase “Please say what this
word is.” This carrier phrase was thought to be long enough and to include enough of the
vowel space that a listener could extract the neutral vowel from it. The 10-step /dɑ/-/gɑ/
series served as the targets. The LTAS comparison of the two ShortVT carrier phrases
and the ambiguous target is shown in Figure 16. An additional component to this
experiment was a second condition in which the carrier phrases were played backward.
If the LTAS is what is important and it is an extraction of the neutral vowel, it should not
matter if the stimuli are played forward or backward. Listeners should be able to extract
the neutral spectral representation from either and use this representation as a referent
that they use when making phonemic judgments.
Figure 16. LTAS comparison between the carrier phrases “Please say what
this word is” and the neutral vowel of only the ShortVT, along with the
ambiguous target from the “da” to “ga” series (dg5) in the region surrounding
F3. (LTAS = Please; LPC = Neut, dg5).
82
3.6.2 Participants and Procedure
Seventeen subjects participated in this study. The stimuli were split into two
blocks based on the direction that the carrier phrase was played (forward or backward).
Participants ran in each block twice (order counterbalanced) and there were 5 repetitions
of each stimulus in each block. This made 400 trials total (2 phrases x 1 talker x 2
directions x 10 targets x 10 repetitions).
3.6.3 Results and Discussion
A 2x2 within-subjects ANOVA was used to analyze the data. The two factors
were phrase (“Please…,” neutral vowel) and direction (forward, backward). The main
effect of phrase was significant (F(1,16) = .46, p<.05), but the main effect of direction
was not (F(1,16) = 6.21, p<.05). The interaction was not significant (F(1,16) = .90,
p>.05. Planned comparisons showed that there was no significant difference between the
two phrases when they were played forward (F(1,16) = 3.48, p>.05), but there was a
significant difference when they were played backward (F(1,16) = 7.18, p<.05). In the
backward condition, the “Please…” phrase resulted in significantly more “da” responses.
See Figures 17 and 18.
The lack of a significant interaction means that the effect size of phrase was
statistically equivalent for the forward and backward presentations. This supports
Prediction #3 tested previously – that forward and backward stimuli have the same effect
on target perception. Preliminary Experiment 3 did not find this. As suggested, there
could be a temporal window over which the LTAS is calculated. If that is the case, it
would be expected that the acoustic information closest to the target for both the neutral
83
Figure 17. Results for Preliminary
Experiment 4: Mean percent “da”
responses for the ShortVT synthesized
forward versions of the neutral vowel
and “Please say what this word is.”
Error bars indicate standard error of
the mean.
*
Figure 18. Results for Preliminary
Experiment 4: Mean percent “da”
responses for the ShortVT synthesized
backward versions of the neutral
vowel and “Please say what this word
is.” Error bars indicate standard error
of the mean. Significance is marked
for the planned comparison.
vowel and the “Please…” phrase is such that it would cause a shift in target perception
whether it was played forward or backward. The acoustic information in this temporal
window for stimuli from Preliminary Experiment 3, however, would not be such that it
would predict the same effects but different ones when the stimuli were played forward
versus backward.
While Prediction #3 was supported, Prediction #4 (that the neutral vowel would
elicit the same shift on target perception as a phrase produced by the same talker) was
84
not. A main effect of phrase was found with simple effects that revealed a significant
difference between the phrases when they were played backward. This suggests that
listeners may not be extracting the neutral vowel from a talker’s speech and then, using
this as a referent from which they base their other perceptual judgments. This prediction
needs to and will be investigated further.
3.7 General Discussion
The results of the preliminary experiments generally supported the LTAS model
as an explanation for effects of talker normalization. Changes in talker were capable of
shifting target perception (Prediction #1), even for consonant contrasts. Also, the
experiment showed that talkers created from the TubeTalker synthesizer were capable of
eliciting talker effects and, so, are acceptable stimuli for this paradigm.
The LTAS account also predicts that two carrier phrases by two different speakers
may not lead to differences in target perception, if their LTAS do not differentially
contrast with the target spectrum (Prediction #2). The results of Preliminary Experiment
2 supported this prediction: a substantial change in vocal tract (O+P- vs. O-P+) did not
lead to a shift in target perception. These results indicate that LTAS comparisons
between the carrier phrases and the ambiguous target member of a series are capable of
predicting when a change of talker will and will not affect the listeners’ subsequent
phoneme identification.
Additionally, the LTAS account presumes that the LTAS of the carrier phrase is
important for target categorization, not the phonemic content. This means that the same
stimulus played backward or forward should have the same effect, since the LTAS of
85
each would be equivalent (Prediction #3). To test this, forward and backward stimuli
were used in Preliminary Experiment 3 and Preliminary Experiment 4. The results of
Preliminary Experiment 3 did not support this prediction, while the results of Preliminary
Experiment 4 did. This led to the possibility that there is a temporal window (and/or
weighting mechanism) over which the LTAS is calculated. This would mean that the
information closest to the target would have the most impact on that target’s perception.
Investigating the possibility of a temporal window is the first specific aim of this work
(Chapter 4).
Another prediction of the LTAS account is that the LTAS of a carrier phrase is an
extraction of the neutral vowel and so both should have the same effect on target
perception (Prediction #4). The results of the forward condition of Preliminary
Experiment #4 supported this prediction. There was no difference in target perception
between the carrier phrase and the neutral vowel. This prediction must be developed
further through analyses of real productions to see if the LTAS of different phrases does
resemble the neutral vowel. This is the second specific aim of the research (Chapter 5).
Generally, it would seem that the LTAS model is successful at predicting when
carrier phrases will lead to perceptual shifts and when they will not. However, it must be
noted that these predictions have been coarse and solely based on visual comparisons
between the LTAS of different carrier phrases and the ambiguous target. The specific
circumstances that lead to a spectral contrast effect are unknown. It could be that the
peaks in energy only lead to an effect when they are within a certain frequency range of
one another. Or it is possible that the peaks are not as important as the troughs or lack of
86
energy in certain frequency regions. These questions need to be explored. Also, it is not
clear whether the backward effects (or lack thereof) have to do with temporal window
estimation or are evidence against the LTAS framework. In order to make quantitative
predictions with specificity and to determine exactly what the LTAS model can account
for, it is necessary to have a better model of what the LTAS representation is for the
listener. Therefore, I propose three specific aims based on the first three hypotheses of
the LTAS Model:
Specific Aim #1: Determine the temporal window over which the LTAS is calculated

Hypothesis #1: During carrier phrases, listeners compute a long‐term
average spectrum (LTAS) for the talker.
Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract.

Hypothesis #2: The LTAS resembles the output of the neutral VT, that
is, the average spectrum resembles the frequency response of the mean
VT shape.
Specific Aim #3: Determine what aspects of the comparison of the LTAS and the target
predict the shift in target categorization.

Hypothesis #3: Listeners represent target vowels relative to the LTAS
of the carrier.
87
CHAPTER 4
THE TEMPORAL WINDOW
Specific Aim #1: Determine the temporal window over which the LTAS is calculated

Hypothesis #1: During carrier phrases, listeners compute a long‐term average
spectrum (LTAS) for the talker.
4.1 Temporal Window
Preliminary Experiments 3 and 4 used the same stimulus sets played forward and
backward to determine whether listeners use the LTAS to make phonemic judgments on
ambiguous speech sounds, irrespective of the phonemic content of the carrier phrase.
Therefore, it was predicted that the same shift would be found in target perception,
regardless of whether the carrier phrase was played forward or backward. This prediction
was not borne out in Preliminary Experiment 3, suggesting that either the LTAS account
is fundamentally wrong or that the LTAS used for predictions is wrong because it
presumes an unweighted integration window that includes the entire carrier phrase. It
may be that the temporal window over which the LTAS is calculated is smaller than the
phrase or is weighted so that the information closest to the target is more important than
the previous spectral information. If this is the case, the backward and forward
conditions are not equivalent from the perspective of the LTAS, and so, differences
between the two conditions are likely. To develop a viable quantitative model of the
LTAS normalization in speech (and in other complex auditory perception), this
information on the temporal window is necessary. The following set of experiments was
88
designed to test for the possibility of a finite temporal window and to provide a rough
estimate of its duration if it exists.
4.2 Experiment 1
The purpose of this experiment is to investigate whether listeners only make use
of LTAS information from a specific time window, in which case only information close
to the target would have an effect on that target’s perception. This experiment used the
same stimuli from Preliminary Experiment 3 (the two carrier phrases “He had a rabbit
and a” synthesized from both the LongVT and Short VT, forward and backward).
Listeners were not only exposed to the full carrier phrase (replicating Preliminary
Experiment 3) but also heard shorter versions of the phrases. These shorter versions were
the 100-msec and 325-msec portions that were closest to the target in the full phrase
conditions. Note that these shorter durations do not have matching acoustic information.
Since the phrases were played forward and backward, the backward conditions actually
contained the acoustic information from the beginning of the phrase (See Figure 19).
Because this information local to the target differed, any temporal window for LTAS
computation that was shorter than the overall phrase length could result in differential
effects. In particular, the speech closest to the target in the backward condition was the
low-amplitude aspiration of the /h/ in “he.” If the temporal window is short (like the
100-msec backward stimulus used here), then it is unlikely that any differences in this
low amplitude sound caused by vocal tract length would be large enough to affect target
identification.
89
4.2.1 Methods
4.2.1.1 Stimuli and Predictions
Carrier Phrases. The four carrier phrases from Preliminary Experiment 3 were
used in this experiment (the Short and Long VT, backward and forward versions of “He
had a rabbit and a”) to create three conditions: full phrase, 325-msec phrase, and 100msec phrase. The full condition contained the full carrier phrases (1.3 sec) in its
backward and forward form. This condition was a replication of Preliminary Experiment
3. The two shorter-duration conditions contained stimuli made from both the backward
and forward full-phrase condition. In Adobe Audition (Balo, et al., 1992), the last 100
msec and last 325 msec portion of each phrase were digitally spliced from the original
phrase to make the new carrier phrases. This created twelve carrier phrases overall (2
VTs x 3 durations x 2 directions). Figure 19 shows a spectrogram and waveform of the
forward full LongVT carrier phrase indicating where the stimulus was cut in order to
create the shorter segments. Again, note that the backward and forward 100-msec and
325-msec phrases do not contain equivalent spectral information (i.e., the 100 msec
backward phrase will have different spectral information than the 100 msec forward
phrase), since the end of the backward stimuli contain the acoustics from the beginning of
the phrase.
Targets. The “bit” to “bet” target series was used for this experiment. The series
was shortened to eleven steps (the last stimulus in the series was cut – bVt12) to ensure
that the experiment would not take longer than one hour. The targets were appended to
90
Figure 19. Waveform and spectrogram of the LongVT’s production of “He had a
rabbit and a.” The dashed vertical lines show the word boundaries. The
horizontal lines show the segments that were used as carrier phrases in either the
forwards or backwards condition for Experiment 1.
each of the twelve carrier phrases after a 50-msec inter-stimulus interval. This created a
total of 132 stimuli (12 carrier phrases x 11 targets).
Predictions. Figure 20 shows the LTAS comparisons for all of the phrase pairs
from which these predictions were derived. As mentioned earlier, one of the purposes of
this research is to develop a model that can make more specific predictions about LTAS
effects. That being said, these predictions are still rather coarse. In the LTAS
comparisons, the region surrounding the target’s first formant is shown (0-1200 Hz) in
Figure 20. The ambiguous target (the fifth member of the “bit”-“bet” series) is
represented by a solid line. The phrases from the LongVT are represented by dotted lines
91
and those from the ShortVT are represented by dashed lines. If the LTAS of a carrier
phrase has both a trough (region of low amplitude) near the frequency of the target’s F1
and a gradual incline that peaks above or below the frequency of the target’s F1, the
carrier phrase is predicted to have a contrastive effect on target perception. In the case of
this ambiguous target, a perceived lowering of its F1 (due to a carrier phrase having a
peak higher in frequency) would result in more “bit” responses. A perceived raising of
its F1 (due to a carrier phrase having a peak lower in frequency) would result in more
“bet” responses.
From the LTAS comparisons, then, it would be predicted that both the forward
and the backward full-phrase condition would result in significantly more “bit” responses
for the ShortVT (Figure 20 – A). However, since the full-phrase condition is a
replication of Preliminary Experiment 3, it is expected that the backward effect will not
be significant. This null effect led to the idea that there may be a temporal window over
which the LTAS is calculated and that the local information (the part of the carrier phrase
closest to the target) might be responsible for the effect. The LTAS comparisons for
these shorter durations are shown in Figure 20 – B-E. From both of the short forward
conditions (325 msec and 100 msec), the same effect as the full-phrase would be
predicted – more “bit” responses for the ShortVT. The LTAS ShortVT for the backward
325-msec condition is somewhat atypical, as there is not a trough but a small plateau at
the frequencies where the peak of the target’s F1 is. It is unclear whether or not this
would have an effect on the target’s perception. The LTAS comparison for the backward
100-msec condition does not predict an effect, as the spectral energy for both of the
92
93
Figure 20. LTAS comparisons for the stimuli in Experiment 1 and Experiment
2. They are between the different durations of the carrier phrases “He had a
rabbit and a” of the LongVT and ShortVT and the ambiguous target from the
“bit” to “bet” series (bVt5) in the region surrounding F1. A-E are relevant for
Experiment 1. A-C are relevant for Experiment 2. The LTAS comparisons are
as follows: A) The Full Phrase; B) Forwards 325-msec segments; C) Forwards
100-msec segments; D) Backwards 325-msec segments; and E) Backwards
100-msec segment. Note that the axes are equivalent, except in the case of
comparison E. These 100-msec segments were much lower in amplitude, so the
amplitude range was adjusted to include both the 100-msec carriers and the
target. (Only the Full phrases in A are using LTAS. The others are LPC.)
94
carrier phrases is distant from the target’s F1 (note that the y-axis shows a greater range
of relative amplitude than the other four images).
If one of these durations (either the 325-msec or the 100-msec) is the time
window over which the LTAS is calculated, then the effects should follow the same
pattern as the forward and backward full-phrase condition. (If replicated from
Preliminary Experiment 3, this means that the forward condition will show an effect and
the backward condition will not.) Expanding the window to include more of the carrier
phrase should also not change the pattern of results. For example, if 100-msec appears to
be the temporal window, then 325-msec should have the same effects on target
perception. This is because only the last 100-msec portion of the 325-msec would be
used in the calculation of the LTAS and this would match the acoustic information of the
100-msec stimuli played alone. Thus, the same effect would be expected in both cases.
4.2.1.2 Participants and Procedure
Thirteen participants took part in this study. The stimuli were split up into three
blocks based on duration (Block 1 = all 100-msec stimuli, Block 2 = all 325-msec
stimuli, Block 3 = all full-length stimuli) so that participants could take breaks, if needed.
Participants were run in each block twice (order counterbalanced) and each block
presented four repetitions of each stimulus for a total of 1,056 trials (1 phrase x 2 talkers
x 2 directions x 3 duration lengths x 11 targets x 8 repetitions).
4.2.2 Results
For the forward full-phrase condition, the ShortVT resulted in significantly more
“bit” responses than the LongVT (t(12) = -2.81, p<.05), but no significant difference was
95
found for the backward condition (t(12)=-.35, p>.05). The 325-msec condition showed a
significant difference for both the forward (t(12)=-4.63, p<.005) and backward (t(12)=3.63, p<.005) stimuli (significantly more “bit” responses for the ShortVT in both). The
forward 100-msec condition also resulted in significantly more “bit” responses following
the ShortVT (t(12)=-3.19, p<.005), but there was no difference between the two vocal
tracts for the 100-msec backward carrier phrases (t(12)=-1.01, p>.05). Figure 21 shows
these results graphically.
4.2.3 Discussion
Replicating the findings of Preliminary Experiment 3, there was a significant
difference between the Long and Short VT when the full phrase was played forward but
not backward. For the 100-msec condition, the pattern was the same: the forward effect
was significant while the backward effect was not. On the other hand, in the 325-msec
condition, both forward and backward were significant. Neither of these durations
appears to approximate the temporal window over which the LTAS is calculated. If one
of them had been, then the stimuli with longer durations should have resulted in the same
pattern of effects.
There are certain limitations of these stimuli and the results that can be pulled
from them, though. Specifically, one drawback of the 100-msec backward stimulus was
mentioned previously – that it consisted of the /h/ of “he.” To listeners, this sounded like
a very short noise burst and did not contain strong spectral energy (as seen by the LTAS
comparison – Figure 20 – E). It seems, then, that such a short stimulus, containing very
96
Figure 21. Results for Experiment 1: Mean percent “bit” responses for the full
phrase and shorter, spliced versions of “He had a rabbit and a.” Error bars
indicate standard error of the mean.
97
weak spectral information would not change substantially as a function of talker and,
therefore, not cause a shift in target perception. Experiment 2 tries to account for this
issue by using a voiced vowel for both the forward and backward 100-msec stimulus.
More generally, though, the results do follow the predictions made based on the
LTAS comparisons. Specifically, when the ShortVT carrier phrases had a trough in the
target’s F1 region that gradually increased in amplitude and peaked higher in frequency
than the F1, the carrier phrase resulted in more “bit” responses. The only LTAS
comparison that did not have a clear prediction was the backward 325-msec condition.
The ShortVT did not have a trough in the region of the target’s F1, but rather had a
plateau right in the region of the peak (see Figure 20) that then peaked higher in
frequency. The results showed that this plateau was enough to elicit a shift in perception,
as this condition resulted in significantly more “bit” responses for the ShortVT, as well.
Future predictions based off of LTAS comparisons will take the plateau-effect into
consideration.
4.3 Experiment 2
In Experiment 1, the effects of the local information to the target were compared
to each other. This provided an indication of the effectiveness of the local information in
the backward and forward conditions of Preliminary Experiment 3. However, this
necessitated that the LTAS of the conditions were all different. In Experiment 2, the
LTAS is maintained by reversing the same portion of the signal in the smaller time
windows. Windows of 100 and 325 msec were again used.
98
4.3.1 Methods
4.3.1.1 Stimuli and Predictions
Carrier Phrases. Figure 22 shows a spectrogram and waveform of the forward
full LongVT carrier phrase with marks where the stimulus was cut to create the shorter
segments. The stimuli were created in the same way as those in Experiment 1. The only
difference was that the backward shorter-duration stimuli were now reversed versions of
the forward shorter-duration stimuli.
Targets. The eleven-step “bit” to “bet” target series was used again. The targets
were appended to each of the carrier phrases after a 50-msec inter-stimulus interval. This
created a total of 132 stimuli (12 carrier phrases x 11 targets).
Predictions. Figure 20 (A-C) shows the LTAS comparisons relevant to this
experiment (full-length, forward 325 msec and forward 100 msec). Since the stimuli
used are from Experiment 1 and the forward and backward conditions now each contain
equivalent spectral information, only one LTAS comparison is needed for each duration.
Based solely on the LTAS comparisons, the predictions are the same as in Experiment 1:
Listeners will hear “bit” more often following the ShortVT than when the targets follow
the LongVT. This is true across all conditions, since each LTAS comparison shows that
the ShortVT had a trough in the target’s F1 region that gradually increased and peaked
higher in frequency. The carrier phrase’s higher peak in frequency would cause the
target’s F1 to be perceived lower in frequency and more “bit”-like. From the results of
the previous experiments, however, it is expected that the full-phrase condition will only
result in a significant shift when it is played forward. The 100-msec and 325-msec
99
conditions, however, should results in significant effects when played both forward and
backward.
4.3.1.2 Participants and Procedure
Twelve participants took part in this study. The procedure is the same as in
Experiment #1. The stimuli were split into three blocks based on the duration (Block 1 =
all 100-msec stimuli, Block 2 = all 325-msec stimuli, Block 3 = all full-length stimuli) so
that participants could take breaks, if needed. Participants were run in each block twice
(order counterbalanced) and each block presented four repetitions of each stimulus for a
Figure 22. Waveform and spectrogram of the LongVT’s production of “He had a
rabbit and a.” The vertical dashed lines show the word boundaries. The horizontal
lines show the segments that were used as carrier phrases in either the forwards or
backwards condition for Experiment 2.
100
total of 1,056 trials (1 phrase x 2 talkers x 2 directions x 3 duration lengths x 11 targets x
8 repetitions).
4.3.2 Results
Again, replicating the results for the full-length condition, “bit” was heard
significantly more often following the ShortVT than when it followed the LongVT when
the stimuli were played forward (t(11)=-3.13, p<.05), but no significant difference was
found between the two when the stimuli were played backward (t(11)=-.56, p>.05). For
both the 325-msec forward and backward, “bit” was also heard significantly more often
following the ShortVT than when it followed the LongVT (forward: t(11)=-5.11, p<.005;
backward: t(11)=-2.66, p<.05). Neither of the 100-msec conditions showed a significant
difference between the two TubeTalker configurations (forward: t(11)=-1.65, p>.05;
backward: t(11)=-.42, p>.05). Figure 23 shows these results graphically.
Since the significant effect found in the 325-msec backward condition appeared to
be smaller than the forward condition, this was tested with a 2x2 within-subjects
ANOVA. The first factor was vocal tract (ShortVT, LongVT) and the second factor was
direction (backward, forward). The main effect of vocal tract was significant (F(1,11) =
30.77, p<.005) and the main effect of direction was significant (F(1,11) = 9.76, p<.05).
The interaction was also significant (F(1,11) = 6.18, p<.05). Planned comparisons
revealed that significantly more “bit” responses followed the ShortVT than the LongVT
in both the forward (F(1,11) = 26.10, p<.005) and the backward condition (F(1,11) =
7.05, p <.05).
101
Figure 23. Results for Experiment 2: Mean percent “bit” responses for the full
phrase and shorter, spliced versions of “He had a rabbit and a.” Error bars
indicate standard error of the mean. Significance is marked for the t-test analyses.
102
4.3.3 Discussion
While the full-length condition for the “He had a rabbit and a” carrier phrase has
proven to be a stable effect (now replicated three times) and the 325-msec effect was
found for the forward and backward condition as expected, the 100-msec effect was not
found for the forward or the backward condition. This is what is predicted from an
LTAS account: the same effect should be found, irrespective of whether a stimulus is
played forward or backward. The spectral content of the LTAS is the same and so the
effect should be the same. The effects found for the 325 and 100-msec conditions fit this,
because the results were the same when the stimuli were played forward and backward
(both significant for 325 msec, both not significant for 100 msec). However, the results
from the 100-msec condition did not replicate the effect from Experiment 1. In
Experiment 1, the 100-msec forward condition was significant. It is possible that 100
msec is too brief of a stimulus to show a consistent effect across participants. The LTAS
mechanism may need a stimulus of a longer duration to generate a stable LTAS by which
comparisons can be made. The results of the 325-msec condition suggest that by 325
msec such a stable LTAS has been calculated, but the interaction showed that the effect
was significantly smaller in the backward condition than in the forward condition. This
was found also for Preliminary Experiment 3. It may be that stimuli do not elicit strong
spectral contrast effects when the listener does not recognize them as speech. This
possibility is discussed in the General Discussion. Again, though, the results from the
325-msec condition do not match the pattern found for the full-phrase condition, so
neither duration seems to approximate that of the temporal window.
103
4.4 Experiment 3
The previous experiments provided a paradigm for examining the duration of an
integration window. If increasing the window does not change the effect size then one
can presume that the effective integration window has been reached. But it is more
difficult to use this paradigm to look at the question of whether the local information is
more heavily weighted in the computation of the LTAS. This experiment provides a
paradigm for mapping out the weighting of the integration window. Vowels produced by
the same talker (predicted to have different effects based on LTAS comparisons) can be
appended as the carrier phrase. If the local information dominates, then the obtained
shifts should be predictable from the vowel closest in time to the target, whereas if the
window uses a weighting function, shifts based on the local talker should be offset by the
addition of the other vowel(s) to the computation of the LTAS.
4.4.1 Methods
4.4.1.1 Stimuli and Predictions
Carrier Phrases. Eight different carrier phrases were created by appending
vowels from the LongVT to one another. The vowels /i/ and /ɑ/ for the LongVT were
chosen, because LTAS comparisons (see Figure 24) between these and the ambiguous
target from the “bit”-“bet” series had the most straightforward predictions. The LongVT
/ɑ/ has a peak in energy right above the first formant of the ambiguous target. This
would contrastively push the effective F1 of the ambiguous target down, leading to more
“bit” responses. The /i/ would have the opposite effect. Its peak is below the target’s F1,
so this would contrastively push the effective F1 up, causing more “bet” responses.
104
Since the previous two experiments did not result in a definitive duration for the
temporal window, the duration for each vowel was chosen to be 250 msec (in-between
both the 100-msec and 325-msec durations used previously). The TubeTalker was used
to synthesize these vowels and then in Adobe Audition (Balo, et al., 1992), the level of
the vowels was set to the same amplitude as the vowel portion of the ambiguous target
(bVt5). The vowels were then appended together with 15-msec of silence in-between
them. There was one short condition (only one vowel), two medium conditions
(Medium1: /i/ appended to /i/, /ɑ/ appended to /ɑ/; Medium 2: /i/ appended to /ɑ/, /ɑ/
appended to /i/), and one long condition (/i/-/i/-/ɑ/ and /ɑ/-/ɑ/-/i/). These created the eight
different carrier phrases.
Targets. The original twelve-step “bit” to “bet” target series was used again. The
targets were appended to each of the carrier phrases after a 50-msec inter-stimulus
interval. This created a total of 96 stimuli (8 carrier phrases x 12 targets).
Predictions. If listeners use a weighted window, then the acoustic content closest
to the target will have the greatest effect on target perception. However, depending on
the acoustic content of the sounds preceding it, the effect size may change. The vowels
/i/ and /ɑ/ have opposite predicted effects on the target. So, for example, if the vowels
are played alone before the target, the /i/ should get more “bet” responses and the /ɑ/
should get more “bit” responses. If the vowels are played one after another, the vowel
closest to the target should have the most effect, but, because the other vowel does have
some impact, it will pull perception in the opposite direction and lessen that effect. The
105
Figure 24. LTAS comparisons
for the stimuli in Experiment
3. The images show the
vowel sequences of /i/ and /a/
synthesized from the LongVT
(LPC used for all
comparisons).
106
specific predictions for each condition for both an unweighted and weighted window are
as follows:
Since there is only one vowel for the Short condition, it should follow the LTAS
prediction stated above: /ɑ/ should lead to more “bet” responses and /i/ should lead to
more “bit” responses. This should be the same for the Medium1 condition (/ii/ vs. /ɑɑ/),
since the vowels (and, thus, spectral content) are the same. For the Medium2 condition
(/iɑ/ vs. /ɑi/), the vowel at the end should still have the most impact on the target’s
perception. However, the other vowel (having the opposite predicted effect than the last
vowel) should decrease the overall size of the effect when compared to the Short
condition. This should be similar to what would happen for the Long condition (/iiɑ/ vs.
/ɑɑi/). The last vowel should have the most effect if the window is weighted, but the
previous two vowels should diminish it, since they would pull a listeners’ perception in
the opposite direction.
If listeners do not weight the spectral information, then their responses should
always be based on the spectral average of the entire stimulus. The order of the vowels
should not matter. The Short and Medium1 conditions should still result in more “bit”
responses for the /ɑ/ or /ɑɑ/ vowels and more “bet” responses for the /i/ or /ii/ vowels.
The Medium 2 should not result in a difference between the two vowel sequences, since
/iɑ/ and /ɑi/ have the same spectral average. The Long condition should also result in no
shift between the two carriers, since they have such similar spectral content.
107
4.4.1.2 Participants and Procedure
There were fifteen participants in this study. The stimuli were split into four
blocks based on the vowel sequences (Long, Medium1, Medium2, and Short).
Participants ran in each block once and there were eight repetitions per block for a total
of 768 trials (1 talker x 8 vowel sequences x 12 targets x 8 repetitions).
Figure 25. Results for Experiment 3: Mean percent “bit” responses for vowel
sequences synthesized from the LongVT. Error bars indicate standard error of
the mean.
108
4.4.2 Results
No significant differences were found between any of the paired vowel
sequences: Long (/i/-/i/-/ɑ/ and /ɑ/-/ɑ/-/i/): t(14) = 1.28, p>.05; Medium1 (/i/-/i/ and /ɑ//ɑ/): t(14) = .24, p>.05; Medium2 (/i/-/ɑ/ and /ɑ/-/i/): t(14) = .63; and Short (/i/ and /ɑ/):
t(14) = .58, p>.05). Figure 25 displays these results graphically.
4.4.3 Discussion
Although specific LTAS comparison predicted null effects, it was surprising that
no shifts in perception were found. It may be that the stimuli were not of the correct type
or did not have the right acoustic characteristics to elicit an effect. Watkins (1991) and
Watkins and Makin (1996a; 1996b) have found evidence that certain types of carrier
phrases do not elicit strong perceptual effects. In particular, it may be that carrier phrases
need more spectro-temporal variation to elicit a spectral contrast effect. This idea will be
discussed at length below.
4.5 General Discussion
Initially, the LTAS model predicted that backward stimuli should have the same
effect as the same stimuli played forward. This was tested in Preliminary Experiment 3
and 4, and the results showed otherwise. While Preliminary Experiment 4 found that
effects were the same, Preliminary Experiment 3 did not. These findings led to the
suggestion that the LTAS may be calculated over a specific time window and if this is the
case, the spectral information closest to the target would affect target perception. This
would provide an explanation for the differences between the previous backward and
forward conditions, since the local information (the acoustics closest to the target) would
109
change depending on if the stimulus was played forward or in reverse. Further, it was
necessary to discern whether the temporal window was weighted or unweighted – that is,
to find out whether the acoustic information closest to the target had the most affect
within the temporal window. The purpose of these three experiments was to determine
the temporal window over which the LTAS is calculated and then, to determine whether
the window had a weighted or unweighted function.
This proved to be more complex than originally thought. The first two
experiments used the full phrase (“He had a rabbit and a”) and two shorter, spliced
versions of that phrase (325 msec and 100 msec) synthesized from the LongVT and
ShortVT to try to determine the duration of the temporal window. In Experiment 1, the
shorter-duration backward phrases had different acoustic content and in Experiment 2,
they had matching acoustic content. The prediction was that if one of the shorter
durations was the temporal window, then extending the window for a longer amount of
time would not show a different pattern of effects than the short duration. This was not
found for either of these shorter durations, so the temporal window still remains
unknown.
The third experiment tried to determine if the temporal window used a weighting
function or not. The stimuli consisted of a sequence of vowels synthesized from the
LongVT (sequences varied in length from one vowel to three vowels). The vowels were
determined by LTAS comparisons that predicted that they would have different effects on
the target’s perception. Further, the vowel sequences were created so that there would be
110
clear predictions that could tease apart whether temporal window was weighted or not.
Unfortunately, no effects were found across any of the conditions.
Based on these three experiments, it is still unclear whether there is a specific
temporal window and if so, whether there is a weighting mechanism attached to it. It is
possible that these stimuli did not have the right characteristics or acoustic structure to
initiate the LTAS mechanism. There could be some differences between the carrier
phrases (the reversed phrases and the vowel sequences) and the target that led to mixed or
null effects.
One of the important ideas behind the model the LTAS model is that it is a result
of a general auditory process. Part of this idea came from the work of Holt (2005; 2006)
who used a series of tones that either had a high-average frequency or low-average
frequency to shift listeners’ perception of a target. If tones could shift the perception of a
target phoneme, then clearly it would be expected that synthesized speech could, as well.
Even reversed speech would be expected to cause a shift in perception. It may not be
intelligible, but would still have similar spectral properties of speech. The preliminary
experiments showed that, indeed, synthesized speech (specifically from the TubeTalker)
could cause shifts in perception. Reversed speech, however, was not as successful. The
most consistent finding from the experiments so far is that the full phrase: “He had a
rabbit and a” does not show an effect of talker when it is played backward, even though
it does when it is played forward. The vowel sequences were not successful, either.
There are two possible explanations for why these stimuli may not have caused an effect:
(1) The LTAS mechanism does not calculate a spectral average for all sounds but only
111
for sounds with certain properties; or (2) The LTAS mechanism calculates a spectral
average for all sounds but, depending on how whether or not that sound is identified as
being similar to the target, it may or may not affect its perception.
This idea that a sound’s identification may separate it from the target comes from
work on stream segregation (Bregman, 1990). In the case of the reversed speech, even
though it is unintelligible, it still is well-formed and has spectro-temporal cues that could
bind it together so that the listener perceives it as a separate “object” and separates it from
the target. This might explain why when the full “He had a rabbit and a” phrase is
played to listeners in reverse, it does not have an effect on target perception. It is bound
together and separated from the target and so has no perceptual effect on it. Watkins
(1991) ran a series of studies to investigate spectral contrast effects and the types of
carrier phrases that can elicit these effects, which included playing stimuli in reverse.
Unlike the results of Experiment 1 and 2, he found an effect for reversed speech on vowel
perception (“itch” to “etch” series) , which suggests that it may not be segregated
separately from the target (and there were effects found for the 325-msec stimuli in
Experiment 1 and 2). His carrier phrase was 1 second in length, which is very similar to
the duration of the carrier phrase used in Experiment 1 and 2 (1.3 sec), so it cannot be due
to a difference in the durations (one maybe providing a more/less stable LTAS than the
other by giving the mechanism more time to calculate it).
However, he did not find an effect for noise carrier phrases that were filtered to
have the same spectral envelope (or LTAS) as his speech carrier phrase or for repetitions
of the same vowel sound. This led Watkins (1991) to suggest that a carrier phrase must
112
be more similar to running speech and have a “time-varying” spectral envelope, rather
than a “time-stationary” spectral envelope, to cause a shift in target perception. This may
explain why the vowel sequences did not elicit an effect; however, Watkins’ vowel
sequences were four repetitions of the same vowel. The stimuli used in Experiment 3
changed between two vowel sounds, although within each vowel the spectral content was
static. Even if this explains the lack of effects for Experiment 3, Watkins’ results still do
not clear up why there was no significant result for the backward full phrase condition.
This still leaves the possibility open that there is a time window for the LTAS
mechanism. Watkins’ carrier phrase could have just had the right spectral content in that
time window to elicit a backward effect. It seems that the question of whether the LTAS
is calculated over a specific duration is still undetermined and needs to be investigated
further. 113
CHAPTER 5
EXTRACTING THE NEUTRAL VOWEL
Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal tract.

Hypothesis #2: The LTAS resembles the output of the neutral VT, that is, the
average spectrum resembles the frequency response of the mean VT shape.
5.1 Extracting the Neutral Vocal Tract
Story and Titze (1998) and Story (2005a) acquired MRI images of the vocal tract
airspace shape of ten sustained vowels for one male talker. The cross-sectional area for
each of the vowel shapes was analyzed, using principal components analysis (PCA).
PCA begins with subtracting the mean (in this case the average cross-sectional area), and
then determines components or “modes” that capture most of the variance of changes
around the means. In his analysis, Story found that two modes accounted for over 90%
of the variance across the ten vowels. Remarkably, these same two modes account for
most of the variance for other talkers (2005b). Thus, the main individual differences in
vocal tract shape are captured in the mean shape (the average across different vowel
shapes), which Story refers to as the “neutral vocal tract.”
From a talker normalization standpoint, the finding that the neutral vocal tract
shape is the main difference between talkers provides a possible answer to how listeners
can accommodate for the variance between speakers. If listeners could estimate the
neutral as a referent, then they could “neutralize” its effect on subsequent productions.
114
One possibility is that the average spectral energy (LTAS) that a listener calculates when
he/she hears someone speak actually reflects the output of a neutral vocal tract. That is,
the average of the acoustic output of an individual over time may be similar to the
spectral content of his/her neutral vowel. If a listener can extract the neutral vowel from
a talker’s output, the listener may be able to use it as a referent from which he/she can
base future phonemic judgments. There are three specific predictions that can be derived
from this hypothesis:
(1) If the neutral vowel is extracted through the additive spectral content of a
specific talker’s speech productions, then the LTAS of running speech should
resemble the spectrum of the neutral vowel.
(2) The more the talker moves around the articulatory space, the more likely the
LTAS will come to resemble the neutral vowel.
(3) If the neutral vowel is extracted, then, it should lead to the same quantitative
shift in target responses as any phrase from that talker.
These predictions are examined in this series of experiments.
5.2 Experiment 4
This experiment tests the first prediction: If the neutral vowel is extracted through
the additive spectral content of a specific talker’s speech productions, then the LTAS of
running speech should resemble the spectrum of the neutral vowel. The “running
speech” for this study consisted of productions from four of the talkers (two males and
two females) from Story (2005b). Particularly, these talkers were used since the shape of
their neutral vocal tract had been derived from analyses of the MRI images. A
115
synthesized version of each talker’s neutral vowel was created from these known neutral
shapes and compared to the average spectral content of the running speech. It was
predicted that the spectral average of the running speech should resemble the spectral
average of the neutral vowel. Further, the addition of more running speech should
provide more spectral information for that talker and, thus, improve the resemblance
between the running speech’s and neutral vowel’s spectrum.
5.2.1 Methods
5.2.1.1 Signals and Predictions
Signals. The running speech consisted of recordings of 11 sentences from four
speakers (2 females, 2 males). These four talkers were SF1, SF2, SM1, and SM2 from
Story (2005b) (referred to as F1, F2, M1, and M2 in this paper; F=female, M=male).
These recordings came from one of the practice sessions that the participants went
through before MRI images were obtained from them. The practice sessions were meant
to simulate what it would be like in the scanner, so the participants were lying down and
wore earplugs during the recordings. The sentences recorded were not picked
specifically for this project but for the author’s purposes at that time. While these
sentences are not phonetically balanced, they do contain a wide range of vowels,
including the point vowels (/iɑu/). A list of the sentences can be seen in Table 2.
To create the longest segment of running speech, all eleven sentences were
appended to one another in a random order for each subject (the order is also listed in
Table 2). To test whether the addition of more speech samples from a particular talker
116
#
1
2
3
4
5
6
7
8
9
10
11
Sentence
F1 F2 M1 M2
I need to catch the eight o’clock flight to New York. 1 5
3
7
She danced like a swan, tall and graceful.
7 7
4
5
You were away.
5 1 11
3
That noise problem grows more annoying each day. 9 2
2
2
We were in Iowa while you were away in Ohio.
4 4
1
8
The heat in the desert can be unbearable.
6 6 10
1
They all know what I said.
8 3
5
6
Things in a row provide a sense of order.
10 11 7
10
I’ll make sense of the problem in a moment.
11 8
6
4
We were away a year ago.
3 9
9
11
If I had that much cash, I’d buy the house.
2 10 8
9
Table 2. A list of the sentences recorded from the four participants and the order
in which each of the sentences was appended for each talker to make the stimuli.
makes the LTAS more similar to the neutral, the appended sentences were spliced to
make stimuli that went from shorter to longer durations (the longest duration being
all 11 sentences). The sentences had a total of 88 words and the shorter to longer stimuli
were created by splicing after every 8 words. These segments were created for all of the
speakers up to the full duration of the entire stimulus. The segments used in the analyses,
were only the first, third, sixth, and eleventh (referred to as Sent1, Sent3, Sent6, and
Sent11). Table 3 shows the cumulative duration in seconds for the segments of each
talker.
F1
Sent1 2.04
Sent3 6.43
Sent6 14.20
Sent11 27.42
F2
2.11
6.50
14.07
25.05
M1
2.81
7.57
14.74
26.23
M2
1.59
6.53
13.61
25.58
Avg.
2.14
6.78
14.16
26.07
Table 3. The duration of each stimulus
used in the LTAS comparisons with the
neutral vowel from each talker.
117
The neutral vowels were synthesized with TubeTalker with the same area
functions as each talker’s neutral vocal tract shape with a source for the female talkers’
set to 210 Hz and the male talkers’ set to 105 Hz. Each neutral vowel was one second in
length.
5.2.1.2 Procedure
For each speaker, the LPC of the neutral vowel was compared to the LPC of the
different segments that went from only eight words to all 88 words. Figures 26 – 29
show the comparisons. Correlational analyses were done on each comparison to see if
the spectra of the neutral and the appended sentences had a similar pattern of amplitude
changes across frequencies and to see if the two were more highly correlated when the
appended sentences were of a longer duration. The neutral-only stimuli contained
spectral information up to 5,000 Hz, so the analyses were calculated on a window from 04,000 Hz to avoid the drop in spectral energy that started in the 4,000-5,000 Hz region.
Also, since the neutral vowel was best represented by the LPC spectrum (its duration was
less than 1 second), the numbers used for the comparisons were extracted from the LPC
of both the neutral and the different sentence lengths for consistency (sampled at 44,100
Hz).
5.2.2 Results
The results from the correlation are found in Table 4.
Neutral vs.
Female1
Female2
Male1
Male2
Sent1
.23
.68
.53
.31
Sent3
.62
.59
.62
.36
Sent6
.61
.58
.60
.34
Sent11
.68
.56
.62
.33
Table 4. Results of the correlational
analyses between the different
sentence lengths of the stimuli and
the neutral vowel for each speaker.
118
Figure 26. LPC
comparisons between
Female1’s neutral vowel
and various sentence
segments.
119
Figure 27. LPC
comparisons between
Female2’s neutral vowel
and various sentence
segments.
120
Figure 28. LPC
comparisons between
Male1’s neutral vowel and
various sentence segments.
121
Figure 29. LPC
comparisons between
Male2’s neutral vowel and
various sentence segments.
122
5.2.3 Discussion
The prediction was that the longer the duration of the appended sentences, the
more closely correlated it would be with the neutral vowel. The pattern would then be
that the correlations grow stronger for each longer segment. However, the pattern of
results across the talkers does not support this prediction. Female1 follows the expected
pattern the most closely. Her weakest correlation is for Sent1 and her strongest
correlation is for Sent11; however, her Sent3 and Sent6 are almost the same. Female2
has the exact opposite pattern than what was expected. Her correlations get weaker as the
duration of the stimulus gets longer. Male1’s weakest correlation is for Sent1 but
plateaus after that and Male2 has the weakest correlations with the neutral out of all of
the talkers. His correlations are also the most similar for each segment duration. The
weakness of these correlations do not provide a strong case for the prediction that the
neutral vowel is extracted from running speech.
However, while these correlations are weak, it does appear that the frequency of
the peak location for both the neutrals and the running speech line up well in some cases,
especially for Male1. It may be that correlations are not a sensitive enough measure to
determine how similar/dissimilar the neutral vowel and running speech are to each other.
Also, these analyses included all of the speech sounds and pauses. The speech sounds
that create noise (like /s/ and /ʃ/) can lessen the prominence of the peaks and the silent
pauses would also lower the average of the LTAS. It may be that these types of sounds
and silences should be taken out before comparing the running speech to the neutral
vowel.
123
5.3 Experiment 5
Experiment 4 tested whether or not running speech could provide an estimate of
the neutral vowel. The stimulus set used in Experiment 4, though, was not phonetically
balanced and so may not have provided a large enough or equally distributed sample of
the possible productions from a talker. This experiment uses more controlled stimuli
synthesized from the LongVT and ShortVT to test Prediction #2: The more the talker
moves around the articulatory space, the more likely the LTAS will resemble the neutral
vowel. From this, one could predict that the best case for determining the neutral vowel
would be a sampling of the entire vowel space. To get this sampling, the stimuli used in
this experiment were vowel trajectories synthesized from the Long and Short VT. The
vowel trajectories moved from /i/ to /ɑ/ to /u/ and were compared to the spectral average
of the neutrals synthesized from the same TubeTalkers. To determine whether the entire
vowel space is necessary for listeners to extract the neutral vowel, LTAS comparisons
were not only made between the full trajectory and the neutral vowel but also at shorter
time segments. If the entire sampling is necessary, then the LTAS comparisons should
show that the longer the segment (i.e., the more vowels sampled), the closer the match
between the LTAS of the neutral and of the vowel trajectory.
There are two different ways in which the LTAS comparisons of the neutral and
the vowel trajectory could be similar. The first is that the spectra of the neutral and the
vowel trajectory could have a similar overall pattern of amplitudes across frequencies
(corresponding peaks and valleys). To determine this, correlations between the spectra
for the neutral and the vowel trajectory were done to provide an index of whether this
124
overall pattern was similar. The second possibility is that the amount of spectral energy
between the two is similar. To investigate this, RMS differences were calculated to
provide an overall “distance” measure between the two spectra. If a full sampling of the
vowel space is necessary for the average to resemble the neutral vowel, then the more
vowels available, the higher the correlation should be and/or the lower the RMS
difference.
5.3.1 Methods
5.3.1.1. Signals and Predictions
Signals. The signals used in the LTAS comparisons were the neutral vowel and
vowel trajectory synthesized from the LongVT and ShortVT. The neutral vowels were
1.1 sec in length (matching the duration of the vowel trajectories). The vowel trajectories
were created to move from the vowel /i/ to /ɑ/ to /u/. Essentially, the vowel trajectory
was the acoustic output of the TubeTalker constantly changing shape through the
different mode combinations required to make the vowel sounds in the trajectory. The
trajectory of the Long and Short VT were spliced at different time intervals from the
beginning of the stimulus to create four different trajectories for comparison (see Figure
30). These time intervals were: 100 msec, 300 msec, 600 msec. Including the full vowel
trajectory (1.1 sec), this created four stimuli from the LongVT and ShortVT for
comparison with their respective neutral vowels.
125
Figure 30. The vowel trajectory synthesized from the LongVT. The lines
mark the time points where the stimulus was spliced to create the stimuli
for comparison to the LTAS of the neutral vowel.
Predictions. If the average of a talker’s acoustic output resembles that talker’s
neutral vowel, then the more samples of his/her articulations, the higher the possibility
that the average will reflect the neutral vowel. In particular, then, the more of the vowel
space that is sampled from the TubeTalker the higher the correlation and/or the lower the
RMS value. That is, the full vowel trajectory should be the most similar to the LTAS of
the neutral than the shorter segments’ LTAS, since it will be the average of the vocal tract
shape moving through all of its possible articulations in the vowel space. Further, it
should be more likely that the spectrum of a talker’s vowel sampling reflect that talker’s
126
own neutral rather than that of another talker. So, the spectrum of the LongVT’s vowel
trajectory should be more similar to its neutral than to the neutral of the ShortVT and vice
versa.
5.3.1.2 Procedure
Due to the short nature of the stimuli, analyses of the LPC for the stimuli were
used in the calculations. Comparisons were made between each of the four vowel
trajectory segments and the neutral vowel for that TubeTalker (either the LongVT or
ShortVT). Comparisons were also made between the full vowel trajectory (1.1 sec) and
the neutral of the opposite vocal tract. The comparisons consisted of correlation analyses
and RMS difference calculations. All of the analyses focused on the first formant region
of the stimuli (0 – 1200 Hz) to see if the first spectral peak lined up for both the vowel
trajectory and neutral. This region was chosen since it contained the frequency region
considered important for spectral contrast effects with the “bit/bet” series. If listeners use
the LTAS as a referent for their perceptual judgments and this reflects the neutral vowel,
then the peak in this area should be similar. See Figure 31, 32, and 33 for graphical
representation of the LPC comparisons.
5.3.2 Results
The correlation values and the RMS difference between the neutral and each
vowel trajectory segment are shown in Table 5.
127
LongVT
Neutral
vs.
100 ms
300 ms
600 ms
1.1 sec
Short – 1.1 sec
Corr.
-.84
-.17
.824
.80
.89
RMS
22.18
14.09
6.36
5.51
4.41
ShortVT
Neutral
vs.
100 ms
300 ms
600 ms
1.1 sec
Long – 1.1 sec
Corr.
RMS
-.84
.163
.97
.87
.58
16.05
10.04
3.29
6.64
8.12
Table 5. The correlations and the RMS differences between the
neutral and vowel trajectory segment for each vocal tract
(LongVT and ShortVT).
5.3.3 Discussion
The main prediction was that the spectral average of the acoustic output of a wide
range of articulations would be similar to the spectral average of the neutral vowel. To
test this, vowel trajectories that went from /i/ to /ɑ/ to /u/ were synthesized from the Long
and ShortVT. They were then spliced at different time points in the full stimulus (100
msec, 300 msec, 600 msec, and the full – 1.1 sec) to create four stimuli for each vocal
tract. Each of these was compared to the neutral vowel, and it was predicted that the
longer stimuli would be more similar to the neutral than the shorter stimuli, since these
would have more vowel content. It was also predicted that a speaker’s vowel trajectory
should be more closely matched to its neutral vowel than to another speaker’s neutral
vowel.
Two analyses were done to determine how closely the vowel trajectories
resembled the neutral vowel. The first was a correlation analysis to determine if the
pattern of amplitude change across frequencies (troughs and peaks) was similar between
128
Figure 31. LPC
comparisons of the neutral
vowel and the vowel
trajectory segments
synthesized from the
LongVT (Vtraj = vowel
trajectory; Vtraj1 = 100 ms;
Vtraj3 = 300 ms; Vtraj6 =
600 ms; Vtraj 11 = 1.1 sec).
129
Figure 32. LPC
comparisons of the neutral
vowel and the vowel
trajectory segments
synthesized from the
ShortVT (Vtraj = vowel
trajectory; Vtraj1 = 100 ms;
Vtraj3 = 300 ms; Vtraj6 =
600 ms; Vtraj 11 = 1.1 sec).
130
Figure 33. LPC
comparisons of the neutral
vowel of one TubeTalker
compared to the full vowel
trajectory of the other
TubeTalker (Vtraj= vowel
trajectory; Vtraj11 = 1.1
sec).
the two. The LongVT and ShortVT had the same pattern of results. The correlation
became more positively correlated as the duration of the vowel trajectory became longer
up to 600 msec. For both vocal tracts, the neutral vowel was highly correlated with the
600 msec portion of the vowel trajectory. This correlation fell slightly for the full vowel
trajectory. The same pattern was found for the differences between the two spectra
(RMS difference) for the ShortVT. The difference gets smaller as the vowel trajectory
gets longer up to 600 msec but then gets larger for the full vowel trajectory. On the other
hand, the RMS difference for the LongVT continually gets smaller and is smallest for the
1.1 sec, full vowel trajectory. From these synthesized stimuli, it does not look like a
sampling of the vowel space is reflective of the neutral vowel. While they get similar in
both the pattern of spectral changes (correlation) and the difference between the two
131
spectra (RMS) up to 600 msec, the additional 500 msec of acoustic information actually
shows that the spectra of the neutral and vowel trajectory start to move away from one
another in all but one condition (LongVT – RMS). Even in this condition, the RMS
value only improves by less than a point. Further, the analysis that compared the
LongVT neutral to the ShortVT full vowel trajectory showed that the LongVT was more
highly correlated with and had the smallest RMS value for this comparison than any of
the ones with its own synthesized vowels. This goes against the prediction that the
average acoustic output of one talker is similar to that talker’s average neutral vowel.
The other comparison that looked at the ShortVT neutral compared to the LongVT full
vowel did not show that it was more highly correlated with the LongVT full vowel
trajectory than its own full vowel trajectory.
Since the mean vocal tract shape is the average of vowel vocal tract shapes, it was
thought that the average of the acoustics of a speaker’s vowels would resemble the output
of the mean vocal tract shape – the neutral vowel. One explanation for these results is
that the vowel trajectories continuously changed formant values across a broad range of
formants. This would flatten the spectral average, making the spectral peaks less
prominent. These analyses also run into the same problems as the previous experiment.
It may be that correlations and RMS values are not sensitive enough to determine how
closely the two are matched. Additionally, only the region surrounding the first formant
was analyzed. The results may have been different had a wider frequency range been
investigated instead.
132
5.4 Experiment 6
If a sampling of the entire vowel space can give listeners an estimate of the
neutral vocal tract, then one would expect that the vowel trajectory or the neutral vowel
from the same speaker would result in the same shift in perception of a target series
(Prediction #3). This experiment uses the neutral vowels and the full vowel trajectories
from Experiment 5 to run a perceptual study that tests this prediction.
5.4.1 Methods
5.4.1.1 Stimuli and Predictions
Carrier Phrases. The neutral vowel stimuli and the full vowel trajectories for the
Long and Short VT from Experiment 5 were used as the carrier phrases for this
experiment.
Targets. An eight-step “bet” to “bat” target series from the StandardVT was used
for this experiment. It was created in the same way as the “bit” to “bet” series (equal
changes in articulatory steps). With these specific LTAS comparisons, strong shifts were
not predicted with these carrier phrases and the “bit” to “bet” series, but comparisons
between the carrier phrases and the “bet” to “bat” series predicted a shift (See Figure 34).
The targets were appended to each of the carrier phrases after a 50-msec inter-stimulus
interval. This created a total of 32 stimuli (2 talkers x 2 carrier phrases x 8 targets).
Predictions. If listeners are extracting the neutral from the vowel trajectories,
then the same shift should be found between talkers (ShortVT and LongVT), irrespective
of whether listeners hear the vowel trajectory or neutral vowel as the carrier phrase. If
listeners do not extract the neutral vowel from the trajectory, then the results should be
133
Figure 34. LPC
comparisons of the
ShortVT, LongVT, and
ambiguous target with both
the synthesized neutral
vowel (A) and the vowel
trajectory (B).
strictly dependent on the LTAS comparisons. The LTAS comparisons (Figure 34) show
that there should be no difference in target perception between talkers when the vowel
trajectories are used as the carrier phrase. However, when the neutral vowels are the
carrier phrases, there should be significantly more “bet” responses for the ShortVT (“bet”
has a lower F1 than “bat” and the ShortVT’s peak is higher in frequency, which would
make the F1 seem lower and more “bet”-like).
5.4.1.2 Participants and Procedure
Fourteen participants were run in this study. The stimuli were split into two
blocks, separated by carrier phrase (Block 1 = Neutrals, Block 2 = Vowel Trajectories).
There were five repetitions per block and participants ran in each block twice (order
134
counterbalanced). This made a total of 320 trials (2 phrases x 2 talkers x 8 targets x 10
repetitions).
5.4.2 Results
The comparison between the neutral vowels for the LongVT and ShortVT showed
a significant difference between the two with ShortVT eliciting more “bet” responses
(t(13)=-4.62, p<.005). No significant difference was found between the LongVT and
ShortVT for the vowel trajectory carriers (t(13)=.35, p>.05). Figure 35 shows the results
graphically.
Figure 35. Results for Experiments 5: Mean percent “bet”
responses for neutral vowels and vowel trajectories synthesized
from the ShortVT and LongVT. Error bars indicate standard
error of the mean.
5.4.3 Discussion
The results of the perceptual study do not support the hypothesis that listeners are
extracting an average that, based on the LTAS paradigm, reflects the neutral vowel and
using it as a referent from which they make perceptual judgments. There was no
135
significant difference between the talkers when the vowel trajectories were used as carrier
phrases, but there was a significant difference between them when the neutral vowel was
used as the carrier phrase. Listeners’ ability to identify a sound seems to be based
directly on the LTAS itself rather than the extrapolation of a neutral vowel. It is not
essential to the model that the average of the spectral energy reflects the neutral vowel.
This experiment suggests that listeners may be solely using the average. However, since
the vowel trajectories do cover a broad range of formant values (as mentioned
previously), it could also be that these stimuli did not have prominent spectral peaks to
elicit an effect of talker.
Interestingly, there was a shift between the neutral vowels of the LongVT and
ShortVT. These neutral vowels were static in nature. Earlier, it was suggested that
stimuli that do not vary spectrally or temporally may not elicit spectral contrast effects.
These results run counter to that claim. These stimuli were longer than the vowels used
earlier (1.1 seconds vs. 400 msec in Preliminary Experiment 4 and 250 msec in
Experiment 3). It could be that listeners need more than brief exposure to a sound to
calculate its spectral average and then, use it as a referent. This needs to be tested in
future experiments.
5.5 General Discussion
The aim of these three experiments was to determine whether or not the LTAS of
a talker reflects the output of that talker’s neutral vowel. This idea led to three specific
predictions:
136
(1) If the neutral vowel is extracted through the additive spectral content of a
specific talker’s speech productions, then the LTAS of running speech should
resemble the spectrum of the neutral vowel.
(2) The more the talker moves around the articulatory space, the more likely the
LTAS will come to resemble the neutral vowel.
(3) If the neutral vowel is extracted, then, it should lead to the same quantitative
shift in target responses as the talker’s speech productions.
These predictions were not supported by the results of Experiments 4, 5, and 6.
Experiment 4 tested the first prediction. Real speech productions (sentences
appended to one another) were used to investigate if running speech reflected the neutral
vowel. It was expected that the more acoustic information available (the more segments
appended together), the more similar the LTAS would be to the neutral vowel. However,
the correlations were not very high. The largest increase in acoustic information was
from Sent6 to Sent11. Averaged across subjects, this added almost 12 seconds of speech
content; yet, this did not consistently improve the correlation between the neutral and the
speech segments. Even if it did improve, it was not by much if compared to Sent6 (the
biggest improvement from Sent6 to Sent11 was for Female1 by .07). The weak
correlations may have been because the entire phrase was included in the analysis.
Meaning, “noisy” speech sounds and silent pauses were included in the LTAS
calculation, which would have flattened the spectrum. If these sounds and silent gaps
were taken out, the spectral peaks may have been more prominent and the correlation
between the running speech and the neutral vowel may have been stronger. It also could
137
be that other measures may be more sensitive to these types of comparisons and should
be used in future investigations.
Experiment 5 used more controlled stimuli that provided a sample of the entire
vowel space for two TubeTalkers (Long and Short VT). The comparisons between these
“vowel trajectories” and the neutral for each TubeTalker were meant to test Prediction 2:
The more the talker moves around the articulatory space, the more likely the LTAS will
come to resemble the neutral vowel. The vowel trajectories were compared with the
neutral vowel for the vocal tracts. They were also spliced into segments that went from a
short duration or sample of the vowel space to the full vowel trajectory of the entire
vowel space. These also did not show a benefit of adding more acoustic information to
the LTAS, having the same issues as Experiment 4. These measures may not have been
sensitive enough to the changes and the stimuli used may need to be changed. The vowel
trajectories’ formants were constantly transitioning through a wide range of formant
values. This flattened the spectrum so that the peaks were not very prominent. This
would explain why the correlations were so weak and also why the vowel trajectories did
not elicit changes in target perception in Experiment 6.
Experiment 6 tested the third hypothesis: If the neutral vowel is extracted, then, it
should lead to the same quantitative shift in target responses as the talker’s speech
production. This perceptual study did not support the idea of neutral extraction. If it had,
the perceptual differences of the target should have remained constant between talkers,
irrespective of whether it followed the neutral or vowel trajectory of the same VT. The
pattern of results did not follow this prediction. When the target followed the neutral
138
vowels, there were significantly more “bet” responses for the ShortVT. When the target
followed the vowel trajectories, there was no significant difference between the two.
Again, this is possibly due to the fact that the vowel trajectories had very weak spectral
peaks, which would not have caused a contrastive effect. This would be predicted by the
LTAS comparisons of the carrier phrases and the ambiguous target. That is, an LTAS
account that does not include the extraction of the neutral vowel but is based only on the
spectral average itself would have predicted that the neutral vowels would have elicited
an effect on target perception and the vowel trajectories would not have. On the other
hand, listeners could be extracting the neutral vowel or the neutral vocal tract shape, but
if they are, it is not through the simple spectral average that the LTAS model suggests.
Further investigation is necessary to tease these two ideas apart.
139
CHAPTER 6
HARMONIC COMPLEX EFFECTS
Specific Aim #3: Determine what changes in the spectral content of the LTAS elicit a
perceptual shift on the target.

Hypothesis #3: Listeners represent target vowels relative to the LTAS
of the carrier.
6.1 Harmonic Complex Effects
As mentioned previously, the predictions from the LTAS model have been rather
coarse. In general, they have been made by looking at the relative frequency position of
peaks in the LTAS and comparing them to a particular formant peak frequency in the
ambiguous vowel. Specifically, if two criteria are met an effect has been predicted.
These are: (1) that the spectral energy of the carrier phrase has a trough (low energy) in
the region of the formant of interest and (2) that the energy from the trough gradually
increases and peaks higher or lower in frequency than the formant of interest. For the
most part, these two criteria have been successful in predicting LTAS effects. However,
it is necessary to develop a model that can make specific, quantitative predictions that are
based on knowledge of the LTAS mechanism, rather than on “eye-balling” the LTAS
comparisons. For instance, it is not clear how close in frequency the peak of the carrier
phrase and the target’s formant need to be for the carrier phrase to have an effect on the
target’s perception. It could be that outside of a certain frequency range of the target’s
formant, the carrier no longer has an effect, even if it does have a peak higher or lower in
140
frequency. It is also unknown whether the peak has to have a certain amplitude to affect
the target. It could be that the peak needs to be at least equal to that of the target’s
formant peak. Further, the role of the troughs or regions of low amplitude have not been
given much consideration. The shifts in perception have been made solely on the peak
location up to this point. It may be that troughs have an effect, as well. To examine these
possibilities so that the LTAS model is capable of making more specific predictions,
these next experiments parametrically explore how previous spectral energy affects target
responses.
6.2 Experiment 7
This first experiment sought to determine how close a peak in frequency needs to
be to the formant of a target to have an effect on that target’s perception. Seven harmonic
complexes were created. Each complex had a different peak in frequency that moved in
100 Hz steps from 300 to 900 Hz. The ambiguous target from the “bit” to “bet” series
(bVt5) was used. The peak of the complexes was matched in amplitude to the peak of the
first formant of the ambiguous target. It was expected that the harmonic complexes with
peaks closest to the first formant would cause a shift in target perception, while the ones
that were further away would not. Again, these are all very coarse predictions with no
true standard for what is considered “further away” or “closer”. It was hoped that this
experiment would start to refine these into true definable, measures.
141
6.2.1 Methods
6.2.1.1 Stimuli and Predictions
Carrier Phrases. Seven harmonic complexes were created in MATLAB. They
were 500 msec in length with a 50 msec amplitude ramp at the beginning and end of the
stimulus. Each harmonic was a multiple of 100 Hz and the complex was composed of 12
harmonics (100-1200 Hz) that had one peak in frequency. The peak in frequency
changed from 300 Hz to 900 Hz. The steepness of the peak was determined by
measuring the change in amplitude between the harmonics in the first formant (downslope) of the ambiguous target of the “bit”-“bet” series. The peak harmonic was matched
in amplitude to the amplitude of the first formant in the ambiguous target. The harmonics
immediately lower and higher in frequency were 10 dB lower and the rest of the
harmonics were 15 dB lower than the highest harmonic. Figure 36 shows the spectra for
three of the harmonic complexes (HC300, HC600, HC900) and the ambiguous target.
Targets. The “bit”-“bet” series (11 steps) were appended to the end of the carrier
phrases after a 50-msec ISI (77 stimuli total).
Predictions. If the peak in the LTAS needs to be within a certain frequency range
of a target’s formant to have an effect on perception, then peaks outside that range should
not have an effect. The harmonic complexes were created in an attempt to determine
what this range is. The harmonic complexes within this range should shift the target’s
perception. In the case of this particular ambiguous target (bVt5), the region around the
first formant is the region of interest, since the first formant is important for the
142
Figure 36. The spectra of three of the harmonic complexs and the ambiguous
target.
distinction between a “bit” and a “bet”. If the harmonic complex peaks higher in
frequency, the first formant of the ambiguous target should be perceived as having a
lower frequency and thus, get more “bit” responses. If the harmonic complex peaks
lower in frequency, then the opposite should happen and it should get more “bet”
responses. The harmonic complexes outside the range should not have an effect on the
target.
6.2.1.2 Participants and Procedure
Twenty participants were run in this experiment. The stimuli were not split up
into separate blocks, but played in the same block. Each stimulus was repeated 3 times
143
per block and participants ran in this block for times for a total of 924 trials (7 harmonic
complexes x 11 targets x 12 repetitions.)
6.2.2 Results
A one-way ANOVA was used to test for differences in percent “bit” responses at
seven different frequency peaks. The percentage of “bit” responses was determined in
the same way as previous experiments (averaged across repetitions of the five middle
members of the target series). The results were not significantly different across the
seven different frequencies (F(6,114)=1.41, p>.05). Figure 37 shows the results
graphically. Figure 38 shows the individual data, as well as the average for each
condition.
Figure 37. Results for Experiment 7: Mean percent “bit”
responses for seven different harmonic complexes. Error bars
indicate standard error of the mean.
144
6.2.3 Discussion
Unexpectedly, the ANOVA showed no significant differences between harmonic
complexes. Looking at Figure 38, it is apparent that there is a large amount of variation
between subjects. One possible reason for this variation could be that these stimuli were
able to distinguish between listeners’ boundary between the vowel /ɪ/ and /ɛ/. That is, the
ambiguous target is not always the same across listeners. For some, it could be an earlier
target in the series and for others, it could be one that is later. That means the same
harmonic complex could have the opposite effect on two listeners’ perception if their
ambiguous targets were not the same. This may not have affected the results of previous
experiments, since there were typically only two different carrier phrases being compared
– so only two frequency peaks were of concern when being compared to the target’s first
formant. This would have less of a chance of washing out any effects, because only the
listeners whose ambiguous target was near one of those peaks may have been affected.
In this case, the peaks were every 100 Hz so more discriminable differences between
listeners’ ambiguous target could have caused any effects to disappear. For this reason,
two more ANOVAs were run. The average of percent “bit” responses across all
conditions was found (grand mean) and the averages for each subject across all
conditions were determined. If the subject’s average was higher than the grand mean,
they were put into one group and if it was lower, they were put into another group. This
was thought to separate the participants by whether they had a lower or higher ambiguous
target. These separate ANOVAs did not find a significant interaction, either (above
grand mean: F(6,42)=1.94, p>.05 ; below grand mean: F(6,66) = .49, p>.05). The
145
Figure 38. Experiment 7 individual differences for each
harmonic complex.
individual differences seem like they are quite complex and cannot simply be accounted
for by dividing the participants into two groups.
Another possibility is that seven carrier phrases were, in some sense,
overwhelming to the participant. Not knowing how the LTAS mechanism works, it
could have “tired” the auditory system by stimulating the same frequencies within a
certain region continuously for up to an hour. As mentioned previously in Chapter 4, the
auditory system could have also segregated the harmonic complexes and identified them
as another auditory event than they did the target. If so, then the harmonic complexes
may not have been able to affect the target.
146
6.3 Experiment 8
The purpose of this experiment was to determine if the peak of the amplitude had
an effect on target perception. It could be that the higher peak has more of an effect than
a smaller peak. Harmonic complexes were again created. This time, there were five
peaks, but each peak had two possible amplitude levels. These were played before the
target series to see if the harmonic complexes with the higher amplitude had a larger
effect on the target’s perception.
6.3.1 Methods
6.3.1.1 Stimuli and Predictions
Carrier phrases. As in Experiment 7, the carrier phrases were harmonic
complexes that consisted of 12 harmonics. They were also 500 msec in length with a 50
msec amplitude ramp on the beginning and end of each stimulus. The peak varied in 100
Hz steps from 300 Hz to 700 Hz. There were two different peak amplitudes. The first
amplitude was the same as used in Experiment 7. The center or peak harmonic was
matched to the target. The harmonics immediately lower and higher in frequency were
set 10dB lower than that. The rest of the harmonics were 15dB lower than the center
harmonic. The second amplitude setting added 15dB to the center harmonic and the
harmonic immediately lower and higher in frequency. The rest of the harmonics stayed
the same. This created a total of 10 carrier phrases (5 peak frequencies x 2 amplitude
levels). Figure 39 shows the spectra of one of the harmonic complexes when it has its
matched peak and its high amplitude peak.
147
Targets. The “bit”-“bet” series were appended to the end of the carrier phrases
after a 50-msec ISI (110 stimuli total). Only the first eleven steps of the series were used
to ensure that the experiment took no longer than one hour.
Figure 39. Spectra of the ambiguous target and the harmonic
complex with a high-amplitude peak at 700 Hz in both the
matched amplitude condition (top – HC 700) and the high
amplitude condition (bottom – HC700-p15).
148
6.3.1.2 Participants and Procedure
Thirteen participants were run in this experiment. The stimuli were split into
blocks based on peak amplitude, so there were two blocks (Block 1 = Matched Peak,
Block 2 = High Peak). There were five repetitions per block and participants ran in each
block twice for a total of 1100 trials (5 peaks x 2 amplitudes x 11 targets x 10
repetitions).
6.3.2 Results
A 2x5 (2 amplitudes x 5 peak frequencies) ANOVA was used to analyze the data.
The main effects of amplitude and peak frequency were not significant (F(1,12) = .12,
p>.05; F(4,48) = 16.46, p>.05). No significant interaction was found (F(4,48) =.61,
p>.05). Figure 40 demonstrates the results graphically and Figure 41 and 42 show the
individual scores for each subject across conditions.
Figure 40. Results for
Experiments 8: Mean
percent “bit” responses for
five different harmonic
complexes at two different
amplitude levels. Error
bars indicate standard
error of the mean.
149
Figure 41. Experiment
8 individual differences
for the matched peak
condition.
Figure 42. Experiment
8 individual differences
for the high peak
condition.
150
6.3.3 Discussion
Again, no significant effect was found. Due to the individual variation, the scores
were again split by whether each subject’s average across conditions was above or below
the grand mean. This did not result in any significant effects, either (above grand mean:
F(4,28)=.82, p>.05; below grand mean: F(4,16)=1.18,p>.05). Possible reasons why there
was no effect found will be discussed in the General Discussion.
6.4 Experiment 9
Experiment 8 investigated whether the amplitude of the peak had an effect on
target perception. This experiment investigates whether the amplitude of the trough
(region of low spectral energy) affects target perception.
6.4.1 Methods
6.4.1.1 Stimuli and Predictions
Carrier phrases. These stimuli were created in the same way as those of the
previous two experiments (500 msec, 50 msec amplitude ramp). Instead of having peaks,
though, these had troughs that’s lowest region varied in 100 Hz steps from 300 Hz to 700
Hz. There were two different trough amplitudes. The first amplitude was meant to
mirror the matched peak values. The three harmonics that created the trough had a lower
value than the rest of the harmonics (the center of the trough was 15dB lower and the
harmonic immediately higher and lower in frequency was 10dB lower. The second
amplitude setting subtracted 15dB from the center harmonic and the harmonic
immediately lower and higher in frequency. The rest of the harmonics stayed the same.
151
This created a total of 10 carrier phrases (5 peak frequencies x 2 amplitude levels).
Figure 43 shows the spectra of a subset of the harmonic complexes.
Targets. The “bit”-“bet” series (11 steps) were appended to the end of the carrier
phrases after a 50-msec ISI (110 stimuli total).
Figure 43. Spectra of the ambiguous target and the harmonic
complex with a trough at 700 Hz in both the trough (top – HC
700-m) and the large trough condition (bottom – HC700-m15)
conditions.
152
6.4.1.2 Participants and Procedure
Fourteen participants were run in this experiment. The stimuli were split into
blocks based on peak amplitude, so there were two blocks (Block 1 = Small Trough,
Block 2 = Large Trough). There were five repetitions per block and participants ran in
each block twice for a total of 1100 trials (5 peaks x 2 amplitudes x 11 targets x 10
repetitions).
6.4.2 Results
A 2x5 (2 amplitudes x 5 peak frequencies) ANOVA was used to analyze the data.
The main effects of amplitude and peak frequency were not significant (F(1,13) = .19,
p>.05; F(4,52) = 1.50, p>.05). No significant interaction was found (F(4,52)=2.41,
p>.05). Figure 44 shows this graphically. Figure 45 and 46 show the individual scores of
all participants across conditions.
Figure 44. Results for
Experiment 9: Mean
percent “bit” responses
for five different
harmonic complexes at
two different trough
amplitude levels. Error
bars indicate standard
error of the mean.
153
Figure 45. Experiment
9 individual differences
for the small trough
condition.
Figure 46. Experiment
9 individual differences
for the large trough
condition.
154
6.4.3 Discussion
No effect was found that provided clues as to how troughs may play a role in
target perception. The data was again split into two groups, and again, no significant
effect was found for either (above grand mean: F(4,32)=1.58, p>.05; below grand mean:
F(4,16)=1.08, p>.05). Reasons why these stimuli may have not elicited any effects are
discussed in the General Discussion.
6.5 General Discussion
The specific aim of these experiments was to determine what changes in spectral
content of the LTAS would elicit a perceptual shift in target. It seemed appropriate to use
very simple stimuli (the harmonic complexes) to investigate this, as the peaks and troughs
could be easily manipulated relative to the target’s first formant. However, it seems that
these stimuli may have been too simple, as no effects were found in any of the three
experiments. As mentioned previously (General Discussion of Chapter 4), it may be that
since these precursors have a similar structure, they are grouped separately from the
target (therefore, having no effect on its perception). The process whereby listeners
determine what acoustic information belongs to which auditory event is known as stream
segregation (Bregman, 1990). This was thought to be a potential explanation for why the
reversed speech of the full phrase “He had a rabbit and a” did not have an effect on
target perception. There is also another possible explanation. Watkins (1991) suggested
that carrier phrases must include spectro-temporal variation to elicit a change in target
perception. These harmonic complexes did not vary, but maintained the same spectral
energy for the 500-msec duration. It seems that could also provide and explanation for
155
the null effects found here, but on the other hand, in Chapter 5 – Experiment 6, the
neutral vowel elicited a shift in perception as a function of talker, despite having no
spectro-temporal variation. It seems that the LTAS mechanism and what elicits effects
from it is rather complex. This will be discussed further in Chapter 7.
156
CHAPTER 7
DISCUSSION AND CONCLUSION
7.1 Overview Summary
Listeners have an amazing ability to understand a wide range of talkers, despite
all of their inherent differences. Talkers vary in their anatomy and physiology, in their
dialect/accent, and even in the speaking styles they may use in different social contexts.
All of these affect the acoustic output of the particular speaker, and for over fifty years,
researchers have tried to figure out how listeners seem to so easily handle this variability.
The process by which listeners compensate for these differences has been referred to as
“talker normalization.” Specifically, this project was concerned with extrinsic talker
normalization (See Chapter 1 for a review of the different theories of talker
normalization).
The paradigm with which these effects have been studied (first used by Ladefoged
and Broadbent, 1957) presents listeners with a series of carrier phrases followed by an
ambiguous target speech sound. The carrier phrases are usually manipulated to sound as
if they are being produced by different talkers. Listeners are asked to identify the target
speech sound at the end and in many cases, when the talker changes, listeners’ perception
shifts, as well. This is taken as evidence for talker normalization – that listeners’
perception is affected by differences between talkers. The previous models and theories
of extrinsic talker normalization have had one thing in common: the proposal that
listeners are learning something specific about the talker during the initial carrier phrase.
157
Whether it is that the listener is creating an acoustic-phonemic map or that the listener is
extracting anatomical/physiological information about a talker’s vocal tract and
articulatory patterns, these models presume that the listener is extracting fairly specific
information about the talker. However, these theories have not always been able to
account for the results across all experiments. Particularly, a substantial change in talker
has not always led to an effect on the target (Darwin, McKeown, and Kirby, 1989;
Dehcovitz, 1997b).
The purpose of this project was to develop and test an alternative model of
extrinsic talker normalization. This model was motivated by two previous findings. The
first was that listeners’ perception of a target speech sound could be shifted from one
phoneme to another, depending on whether a series of tones that preceded it had a high or
low average frequency (Holt, 2005; 2006). If the tone sequence had a higher average
than the frequency region important for that target speech sound’s identification
(typically, a formant), then the formant would be perceived as lower in frequency and be
identified as the speech sound with the lower formant (Lotto & Holt, 2006). The current
model extends this spectral contrast effect to try to account for the effects of carrier
phrases on target perception. The model suggested that listeners could be calculating an
average across the carrier phrase – in this case, a Long Term Average Spectrum or LTAS
– and using this as a referent for identification of ambiguous target sounds. If a peak in
the LTAS was close to the important frequency region (or formant) of a speech sound,
then it would be predicted to have a contrastive effect on that sound’s perception.
However, if none of the peaks in the LTAS were close to important frequency regions,
158
the carrier phrase would not have an effect on the target’s perception. By making LTAS
comparisons between the carrier phrase and the target, one could not only predict that the
carrier would have an effect but also the particular effect it would have.
The second finding that influenced the model was based on analyses of vocal tract
area functions that came from MRI images (Story, 2005b). These analyses showed that
across different talkers’ vowel productions, most of the variability between talkers could
be captured by their neutral vocal tract shape or average shape across all of the vowels.
Based on this result, it is tempting to speculate that listeners could accommodate
individual talker differences by estimating this neutral vocal tract shape. Such a referent
(a cross-sectional area of a vocal tract air space) would only be useful if one presumed a
model of talker normalization that was based on an internal vocal tract model (perhaps
through Analysis-by-Synthesis). Another possibility is that the acoustic output of a
neutral vocal tract – the neutral vowel – would also provide some normalization of talker
differences when used as a referent. The final theoretical leap is that the LTAS of a
talker’s speech may, in fact, reflect the spectral characteristics of the neutral vowel. This
would be because the speech produced would be the output of a variety of vocal tract
shapes and the neutral vowel is essentially the output of the average vocal tract shape. It
would seem plausible that the average of the output of a variety of vocal tract shapes
could resemble the neutral vowel eventually. The LTAS being calculated and used as a
referent, then, could be similar in spectral content to the neutral vowel.
Based on these two findings and their theoretical extensions, an LTAS model of
talker normalization was developed, which had four main hypotheses: 1) during carrier
159
phrases, listeners compute an LTAS for the talker; 2) this LTAS resembles the spectrum
of the neutral vowel; 3) listeners represent subsequent targets relative to this LTAS
referent; and 4) such a representation reduces talker-specific variability.
From the results of three preliminary experiments (Chapter 3), it was noted that
the predictions from the LTAS were not very precise. Predictions were the result of
rough comparisons of the LTAS and target spectra. It would be preferable to have a
model that could make precise, quantitative predictions. The preliminary studies did not
always support predictions of the basic LTAS model. Specifically, the pattern of
expected results was not obtained when the carrier phrases were played backward. This
led to the idea that the LTAS may be calculated over a specific time duration, not over an
entire phrase. Together, with the idea of the neutral vocal tract, this led to three specific
aims for the current project:

Specific Aim #1: Determine the temporal window over which the LTAS is
calculated. (Chapter 4 – Experiments 1, 2, 3)

Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal
tract. (Chapter 5 – Experiments 4, 5, 6)

Specific Aim #3: Determine what aspects of the comparisons of the LTAS and
the target predict shifts in target categorization. (Chapter 6 – Experiments 7, 8, 9)
Below, I give a short summary of the results of the experiments associated with each aim
followed by an interpretation of the results in light of the model. Future experiments are
proposed in each section based on questions raised by the results.
160
7.2 Specific Aim #1: Determine the temporal window over which the LTAS is
calculated.
Preliminary Experiment 3 found an effect of talker when stimuli were played
forward but not when they were played backward. This was a challenge for the LTAS
model, as it would predict that contexts with the same LTAS (like a signal played
forward or backward) should have the same effect on listeners’ perception of the target.
The lack of a backward effect led to the prediction that there may be a temporal window
over which the LTAS is calculated. That is, it may be that only some duration of the
acoustic information closest to the target will affect that target’s perception. As the
forward and backward contexts have different information near the target (the backward
actually having the acoustics from the beginning of the carrier phrase), this could explain
the discrepancy in the results. The purpose of the first two experiments, then, was to
determine whether a temporal window could explain the previous results using a gated
presentation. The third experiment used a complementary paradigm (adding information
instead of subtracting) to try to provide an estimate of the duration of such a window.
The first two experiments used the same stimuli from Preliminary Experiment 3
(carrier phrase “He had a rabbit and a” played both forward and backward). These
stimuli were spliced to create a stimulus set that had different durations (100 msec, 325
msec, and 1100 msec). Carrier phrases containing the same content in the critical
temporal window were expected to have the same effect regardless of the content outside
of the window. The results from these experiments, however, did not show a definitive
time window. Different patterns of effects were obtained for each gating duration. The
161
fact that significant effects for backward presentation were obtained for 325-msec
durations but not for the full phrase suggests that if there is a window of integration for
LTAS computation, it is longer than 325 msec.
Experiment 3 was designed to examine the effects of context duration on the
computation of the LTAS by appending stimuli with well-described spectral
characteristics (steady-state vowels). By measuring the effects of appending additional
vowels, it was proposed that one could determine the extent of the temporal window and
even perhaps the weighting of the information as one moves distally from the target.
However, none of the manipulations resulted in changes in target perception.
7.2.1 Interpretation and Future Research
The finding that the full-phrase “He had a rabbit and a” results in an effect of
talker when it was played forward but not when it was played backward was a consistent
result across multiple experiments (Preliminary Experiment 3, Experiment 1, and
Experiment 2). This is a challenge to the most-simple LTAS-based model in which the
LTAS is computed across the entire context. Neither of the shorter gating durations
examined resulted in a pattern of effects that would indicate a temporal window for
LTAS computation. One possible interpretation is that one cannot obtain context effects
for reversed speech as it does not contain phonetic information. Whereas this possibility
is discussed further in section 7.5, it is unlikely true in its strongest form since backward
effects have been previously demonstrated (Watkins, 1991) and a backward effect was
obtained for the 325-msec gating window (and in Preliminary Study 4). Another
possibility is that the temporal window over which the LTAS is computed is longer than
162
325 msec but shorter than the overall phrase duration (1100 msec) and that within this
window the LTAS of the forward and backward contexts differ. Future studies will
explore longer temporal windows. However, a new phrase should be used to verify that
there is a difference between backward and forward stimuli. Further, because the phrase
used contains “bit” in the carrier (“rabbit”), this could be a problem for using this phrase
for tests of longer temporal windows. For example, splicing the phrase 750 msec from
the beginning results in: “He had a rab”. This would be a problem when the targets
consist of a “bit” to “bet” series. Lexical information could skew listeners to hear more
“bit” responses as this would create a word. Another phrase should then be used to avoid
lexical biases and to confirm that there are differences between forward and backward
phrases.
The lack of effect of the appended vowel contexts in Experiment 3 provides no
insight into the estimation of a temporal window, but it does demonstrate a constraint on
the normalization mechanism. It is clear that some contexts elicit no effects on target
perception despite having appropriate phonetic and spectral content. Suggestions on
what underlies these constraints are developed further in section 7.5.
7.3 Specific Aim #2: Determine whether or not the LTAS reflects the neutral vocal
tract.
The second specific aim was to determine whether or not the LTAS reflects the
neutral vowel. Experiment 4 compared the LTAS of running (real) speech to the
spectrum of the neutral vowel for four different talkers (two female, two male). It was
predicted that the more running speech that was included in the calculation, the more the
163
LTAS would resemble the neutral vowel. As this was the first investigation of this
proposal, the computations and comparisons were performed with the fewest assumptions
possible – the envelope of the LTAS computed across entire sentences were correlated
with the envelope of the spectrum of the neutral vowel. The resulting correlations were
not convincingly high. One could easily argue that using a correlational analysis was a
rudimentary way of investigating this idea and other methods could better describe the
relationship between the running speech and the neutral vowel. Also, the correlations
compared the frequency region from 0-4000 Hz. If a smaller region had been analyzed
(for example, only the region that included the first three formants), the correlations may
have been much stronger.
Perhaps most importantly, LTAS was computed across the entire sentence
without prejudice. However, the running speech consisted of sentences that included an
array of different speech sounds, including fricatives and stops as well as silent gaps.
These types of sounds likely flatten the spectrum at lower frequencies near the first two
formants. It could be that if these sounds were taken out, the LTAS of the running
speech would include more distinct resonances that would match the neutral vowel.
In Experiment 5, synthesized stimuli from both the LongVT and the ShortVT
were used to test whether the acoustic output from a speaker could resemble that
speaker’s neutral vowel. The stimulus that was compared to the neutral vowel for each
TubeTalker was a vowel trajectory. The vowel trajectory was the output of vocal tract
movement from /i/ to /ɑ/ to /u/. It was predicted that including a greater sampling of the
vowel space would result in an LTAS that would come to resemble the neutral vowel.
164
The results of the LTAS comparisons between the vowel trajectory stimuli and the
neutral vowels did not show that the more of the articulatory space that was sampled, the
closer the output resembled the neutral. One possible reason for the failure to support
this prediction is that these signals contained continuous changes in formant values,
which will result in a more flattened LTAS representation. As a result, the correlations
between LTAS and neutral were actually better when only half the vowel space was
sampled. When both halves were included in the computations, the correlations
decreased, in part, because the averaging across a broader range of formant frequencies
results in less distinct spectral peaks especially in the region of F2.
Rather than comparing the LTAS of different speakers, Experiment 6 compared
listeners’ perception of a target following either the neutral vowel or the full vowel
trajectory of either the LongVT or the ShortVT. If listeners are extracting the neutral
vowel, then the pattern of effects across talkers should have been the same whether the
carrier was the neutral or the vowel trajectory. However, the pattern of results was not
the same. The neutral carrier phrase had a significant effect of talker. The vowel
trajectories did not. Again, this difference in perception could result because the
continuous vowel trajectories do not result in very distinct LTAS patterns.
7.3.1 Interpretation and Future Research
The results of Experiment 4 do not provide strong evidence that the LTAS of
running speech resembles the spectrum of the neutral vowel. However, this study
included comparisons with the fewest assumptions about how the LTAS is calculated. In
particular, the inclusion of silences (e.g. stop closures) and fricative and aspiration noise
165
in the LTAS likely results in a flattened LTAS in the region of interest for the studies
conducted (in the region of F1 and F2). An obvious next step is to examine the LTAS for
only voiced sections of the speech signal. This is likely to result in more distinct spectral
peaks that may be comparable to the resonances of the neutral vocal tract. It should be
noted that such a step has implications for the LTAS model of normalization. An
additional mechanism for detecting voicing in the signal will have to be proposed to limit
LTAS computation to just those segments of the signal.
In addition to testing different ways of calculating the LTAS from the signal, the
next steps in this project will include the development of new ways to test the match
between the LTAS and the neutral vowel. The correlation and RMS methods used in this
dissertation are simple and straight-forward, but perhaps too coarse. Future measures
could include correlations and distances of estimated peak frequencies. Even with novel
metrics, it is unlikely that the LTAS and the neutral vowel will be identical for most
speech. An interesting question to follow up on is whether the LTAS provides significant
normalization if used as a referent. Vowels spaces can be plotted for multiple speakers
both in objective F1-F2 space and in a space with the frequencies of the first two major
peaks of the LTAS serving as the origin (reference point). If vowel category dispersion is
less in the LTAS-reference space, this would indicate some beneficial normalization even
if the neutral vowel itself is not extracted.
The results of Experiments 5 and 6 indicate that continuous excursions of the
vowel space may not be the best signal for estimating the neutral vowel, because spectral
peaks tend to be flattened by the averaging. Future studies will examine how the neutral
166
vowel compares to the LTAS across isolated productions of all of the vowels of English
or just the point vowels. It is likely that these vowel spaces will also result in some
flattening of the LTAS, which raises the interesting possibility that a more poorly
sampled articulatory space (as occurs in real speech) may provide a more effective LTAS
– at least in terms of its effects on target perception. This will be tested in future studies
pitting vowel space carrier phrases against sentences that are phonetically balanced or
imbalanced. The LTAS and neutral vowel acoustic comparisons will be paired with
perceptual studies. These studies have the potential of further elucidating why carrier
phrase effects occur for some contexts and not others.
7.4 Specific Aim #3: Determine what aspects of the comparisons of the LTAS and
the target predict shifts in target categorization.
The experiments in Chapter 6 aimed to determine the finer details of spectral
contrast mechanism. Specifically, Experiment 7 investigated how close in frequency the
peak of the carrier needed to be to the target’s formant to shift perception. Experiment 8
investigated whether the amplitude of the peak mattered and Experiment 9 looked at
whether troughs (regions of low amplitude) played a role in perception. It was thought
that the best way to get at this would be to use very simple context stimuli whose peaks
and troughs could be easily manipulated. Harmonic complexes provided the necessary
control while maintaining some similarity to speech stimuli. The set of stimuli utilized in
these studies consisted of steady-state sounds constructed from 12 harmonics (multiples
of 100 Hz) with different amplitudes. The peaks and troughs in these patterns were based
on the target stimuli spectra and included the range of parameters predicted to most likely
167
result in contrast effects. No shifts in target perception were obtained from any of the
conditions across the three experiments.
7.4.1 Interpretation and Future Research
Before completely discounting the use of harmonic complexes as a preceding
carrier in these experiments, it should be noted that methodological changes could result
in more sensitive measures. In Experiments 7-9, the stimuli were randomized within
each block. Sjerps, Mitterer, and McQueen (2011a) claim that context effects are more
robust when blocking by preceding carrier, rather than mixing the presentation of the
stimuli within a block. They suggest that when the stimuli are heard together, the
preceding carriers may be interacting across trials and that this may weaken effects. If
this is the case, then the harmonic complexes could potentially elicit effects if they are
separated into different blocks and presented repeatedly, instead of randomized.
If these methodological changes do not result in significant context effects, then
the complete lack of shifts for these stimuli indicates a very strong constraint on the types
of context that will elicit effects on target perception. Previous work has demonstrated
large effects of non-speech sounds on speech perception, whether it be sequences of tones
(Holt, 2005; 2006) or single frequency-modulated tones (Lotto & Kluender, 1987).
These previous stimuli tended to include frequency or amplitude modulation. This may
be an important factor for eliciting such effects (as discussed below).
7.5 Additions to the Model Suggested By Current Results
The LTAS model presented here assumed that any stimulus could affect the
perception of a subsequent target sound if it contained a peak in spectral energy in a
168
region around an important phonetic cue (e.g., a specific formant). However, not all
contexts resulted in shifts in target perception in the experiments reported here,
challenging the assumptions of the most simple version of the model. These unpredicted
results provide an opportunity to develop new hypotheses about the mechanisms involved
in talker normalization effects.
First, let’s consider the reversed-phrase contexts. One of the most consistent
effects across the experiments (Preliminary Experiment 3, Experiment 1, and Experiment
2) was that the forward full phrase “He had a rabbit and a” resulted in more “bit”
responses for the ShortVT than for the LongVT. However, when the phrase was played
backward, these effects were either smaller or absent. One possible constraint on context
effects is that the target and context must be considered part of the same auditory object
or stream. Bregman (1990) referred to the perceptual process of grouping acoustic
characteristics into coherent objects or streams as Auditory Scene Analysis. According
to Bregman’s formulation, sounds only interact with each other when they are part of the
same stream. It is more difficult to determine relations (such as temporal order) when the
sounds belong to different streams. This provides a possible explanation for the
smaller/absent effect for backward contexts versus forward. When “He had a rabbit and
a” is presented forward, it is easily identified as speech and would belong to the same
auditory event as the following target. When it is played backward, it is not readily
identified as speech, but still has a strong acoustic structure that may cause it to be
identified as a different auditory event. Importantly, the internal structure of the
backward speech would result in a strong cue for formation of its own auditory object
169
separate from the target. In this way, the backward context here would be different from
the tone sequences of Holt (2005), which did not have a strong internal structure. As a
result, these tones may have not formed a strong separate object and could therefore
interact with the speech target resulting in a significant context effect. It should also be
noted that the formation of auditory streams takes time (Bregman, 1986), so that shorter
contexts may not be as strongly segregated perceptually, which may explain the
significant effects obtained for the 325-msec backward contexts in Experiments 2 and 3.
If future studies provide more evidence of a strong role for perceptual streaming in carrier
phrase effects then it will suggest that the mechanism for talker normalization must
follow the processes involved in Auditory Scene Analysis.
It should be noted that significant effects have been obtained for reversed contexts
in other studies. Watkins (1991) ran a series of experiments – two of which used
reversed speech stimuli. Both of these experiments found effects for the reversed speech.
His carrier phrases were male and female recordings of the phrase “The next word is”
(male: 1 sec; female: 1.07 sec). These carriers were around the same length as the
reversed carrier used here (1.3 sec). Sjerps, Mitterer, and McQueen (2011b) ran a study
comparing the perception of a target series following normal speech, spectrally-rotated
speech, and highly-manipulated speech. The carrier phrase in this case was a Dutch
phrase (English translation: “On that book, it doesn’t say the name”) that was
synthesized to sound like different talkers by manipulating the F1 value of the phrase.
The exact length was not given, but this phrase is clearly of comparable size to Watkins
(1991) and the “rabbit” phrase. Similar to reversed speech, the spectrally-rotated speech
170
has strong spectro-temporal characteristics but would no longer be identified as speech.
The purpose of the “highly-manipulated” speech was to make it more unlike normal
speech. This was done by not only spectrally-rotating the speech but also by flattening
the pitch, removing certain amplitude ranges, reversing every syllable, and matching the
amplitude across the entire phrase. The spectrally-rotated speech was found to have an
effect on target perception, but the highly-manipulated speech did not. Given the number
of differences between the contexts and targets used in these studies and the current
experiments, it would be difficult to provide a full explanation for the differences in
results with any confidence. Future experiments will manipulate the carrier phrases used
to examine how some of these difference (for example, having a voiced segment
immediately preceding the context) may affect the outcome.
The other contexts that did not cause a perceptual shift in target identification
were the vowel repetitions of Experiment 3 (examining whether there was a weighted
temporal window for the LTAS) and the harmonic complexes of Experiment 7, 8, and 9.
Vowel repetitions should not be at risk of being segregated separately from a target
speech sound, as they should be identified as speech. Watkins (1991) also ran an
experiment that used vowels. Each vowel sequence consisted of the same vowel (200
msec) repeated four times (the vowel was either /ɪ/, /ɛ/, or the ambiguous vowel in
between the two). The vowels used in Experiment 3 were 250 msec length and the
sequences were some combination of /i/ or /a/, extending from one to three vowels. In
this case, Watkins (1991) was unable to find an effect. He suggested that contrastive
effects require that the carrier has time-varying spectro-temporal properties, which static,
171
synthesized vowels do not. While this seems plausible, Experiment 6 did find an effect
when static vowels were used. The neutral vowels were 1100 msec in length, synthesized
from the Long and Short VT (ShortVT resulted in more “bet” responses). Static vowels,
then, are capable of causing shifts in perception.
The harmonic complexes used in Chapter 6 could have both the problem of being
segregated into a separate auditory object and of not having spectro-temporal changes. It
is interesting to note, though, that Holt (2005;2006) and Laing, Liu, Lotto, and Holt
(2012) found effects when the carrier phrases they used were series of sine wave tones
with high- or low-mean averages. In Laing, et al. (2012), these non-speech carriers
consisted of seventeen 70-msec tones with 30-msec ISIs (1700-msec total). These simple
tones were able to cause an effect but not the harmonic complexes. This is curious as the
harmonic complexes are only slightly more “complex.” Previous work has suggested
that spectro-temporal variation is essential for obtaining context effects (Watkins, 1991;
Sjerps, Mitterer, and McQueen, 2011b). This cannot be completely true as context
effects can be obtained for the neutral vowels. However, it is possible that frequency or
amplitude modulation enhances the context effects. The auditory system is very sensitive
to modulations and spectro-temporal variability may result in more robust representations
of the context. Future studies will examine the influence of adding frequency and/or
amplitude modulations. If it is the case that spectro-temporal variability enhances talker
normalization effects, this will have major implications for the LTAS model. It will be
necessary to determine whether the modulation enhances LTAS computation, reduces
perceptual stream segregation, or enhances the mechanism of contrast.
172
7.6 Conclusion
The simple LTAS model of extrinsic talker normalization was based on a set of
findings in perception and speech production. The first model was based on the most
basic assumptions of how the LTAS is calculated, how it relates to the neutral vowel and
how contrast effects are predicted. The results of nine experiments demonstrate that
many additions will have to be added to the basic model to make it viable. In particular,
it is quite possible that LTAS computation occurs over some finite temporal window and
only on segments that are voiced. The spectral contrast mechanism that changes target
perception relative to the LTAS likely follows perceptual grouping (Auditory Scene
Analysis) and may be strongly affected by the spectro-temporal variability of the context.
The continued research based on the results of these experiments is likely to provide
insight both into general auditory processes for complex sounds and into adaptive
processes in speech perception.
173
APPENDIX A. AMBIGUOUS TARGET SELECTION
The ambiguous targets for the “bit” to “bet” to “bet” series was determined in a
brief pilot study. The study consisted of three participants. Each participant listened to
the 20-step target series (described in Ch. 3, Section 3.2.2) that changed perceptually
from “bit” to “bet” to “bat”. They then verbally responded as to which member of the
series was the ambiguous target and to where the target series began and stopped. They
were allowed to listen to the target series as many times as they needed to make a
decision. Since these three participants were all highly familiar with listening to and
creating experiments with the carrier phrase + target paradigm, their experience was
deemed sufficient to make these judgments. The stimuli were chosen by the majority.
For example, one participant thought bVt6 was the ambiguous target in the “bit” to “bet”
series, but the other two thought it was bVt5, so bVt5 was chosen as the ambiguous
target. From these results, the target “bit” to “bet” series consisted of the first 11 stimuli
from the series (bVt1-bVt11). The ambiguous was judged as the fifth member (bVt5).
The “bet” to “bat” series consisted of 8 steps (bVt10-bVt17). The ambiguous was judged
to be the fourth member (bVt13).
174
APPENDIX B. FILTERED EXPERIMENTS
To make LTAS predictions with the “bit” to “bet” to “bat” series, it was necessary
to determine which formant(s) played a role in the perception of the vowel. That is,
which formants were susceptible to spectral contrast effects. In particular, it was tested
whether the first formant carried the effect or if the second formant (and higher) also
played a role. To test this, neutral vowels synthesized from the LongVT and ShortVT
were presented to listeners either with no filter, with a high-pass filter, or with a low-pass
filter. Cut-off for the high-pass and low-pass filters was 1200 Hz. These served as the
carrier phrases and were followed by the 8-step “bet” to “bat” series. Eighteen
participants ran in this study. Listeners heard 10 repetitions of each stimulus for a total of
480 trials (2 vocal tracts x 1 phrase x 3 filters x 8 targets x 10 repetitions). Results
showed that the ShortVT resulted in significantly more “bet” responses (t(17) = -5.95, p<
.001), as did the low-pass condition (t(17) = -8.53, p<.001). The high-pass condition
showed no significant difference between the two phrases (t(17) = 1.01, p>.05). Since
the low-pass filtered stimuli (having the first formant) resulted in a significant difference,
it suggests that the first formant is the cue that is important for the “bet/bat” distinction
and is susceptible to spectral contrast effects. It does not seem that the higher formants
play a role, since no effect was found for the high-passed filtered stimuli. Since the
“bet/bat” targets were part of the extended series that included “bit” to “bet”, it was
assumed that these results could generalize to the other end of the series, as well. Also,
previous literature has only focused on manipulating the first formant when creating
175
series that use the vowels /ɪ/ and /ɛ/, while keeping the other formants constant (Darwin,
McKeown, & Kirby, 1989; Watkins, 1991; Watkins & Makin, 1994; Sjerps, Mitterer, &
McQueen, 2011a; 2011b). This provides further evidence that the first formant is the
important cue for the “bit/bet” distinction. Therefore, the LTAS comparisons used to
make predictions focus on the first formant region of the ambiguous target of the series.
176
REFERENCES
Ainsworth, W. A. (1972). Duration as a cue in the recognition of synthetic vowels. The
Journal of the Acoustical Society of America, 51(2B), 648–651.
Ainsworth, W. A. (1974). The influence of precursive sequences on the perception of
synthesized vowels. Language and Speech, 17(2), 103–109.
Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant &
M. Tatham (Eds.), Auditory analysis and perception of speech, (pp.105-113).
London: Academic Press.
Assmann, P. F., Nearey, T. M., & Hogan, J. T. (1982). Vowel identification:
Orthographic, perceptual, and acoustic aspects. The Journal of the Acoustical
Society of America, 71(4), 975-989.
Balo, S., et al. (1992-2004). Adobe Audition 1.5. [computer software]. San Jose, CA:
Adobe Systems.
Blumstein, S. E., Stevens, K. N. (1979). Acoustic invariance in speech production:
Evidence from measurements of the spectral characteristics of stop consonants.
The Journal of the Acoustical Society of America, 66(4), 1001–1017.
Blumstein, S. E., & Stevens, K. N. (1980). Perceptual invariance and onset spectra for
stop consonants in different vowel environments. The Journal of the Acoustical
Society of America, 67(2), 648–662.
Blumstein, S.E., & Stevens, K. N. (1981). Phonetic features and acoustic invariance in
speech. Cognition, 10(1–3), 25–32.
Bradlow, A. R., Kraus, N., & Hayes, E. (2003). Speaking clearly for children with
learning disabilities: Sentence perception in noise. Journal of Speech, Language
and Hearing Research, 46(1), 80-97.
Bregman, A.S. (1990). Auditory scene analysis: The perceptual organization of sounds.
Cambridge, MA: MIT Press.
Broadbent, D. E. & Ladefoged, P. (1960). Vowel judgements and adaptation level.
Proceedings of the Royal Society of London. Series B. Biological Sciences,
151(944), 384-399.
Byrne, D. (1977). The speech spectrum-some aspects of its significance for hearing aid
selection and evaluation. British Journal of Audiology, 11(2), 40-46.
177
Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R., ... & Ludvigsen, C.
(1994). An international comparison of long‐term average speech spectra. The
Journal of the Acoustical Society of America, 96(4), 2108-2120.
Chen, Y. (2008). The acoustic realization of vowels of Shanghai Chinese. Journal of
Phonetics, 36(4), 629-648.
Chevreul, M.E. (1839). De la loi du contraste simultane des couleurs. Paris: PitoisLevereault.
Chiba, T. and Kajiyama, M. (1958). The Vowel, Its Nature and Structure. Phonetic
Society of Japan.
Cleveland, T. F., Sundberg, J., & Stone, R. E. (2001). Long-term-average spectrum
characteristics of country singers during speaking and singing. Journal of Voice,
15(1), 54-60.
Clopper, C. G., Pisoni, D. B., & Jong, K. de. (2005). Acoustic characteristics of the
vowel systems of six regional varieties of American English. The Journal of the
Acoustical Society of America, 118(3), 1661–1676.
Cornelisse, L. E., Gagnńe, J. P., & Seewald, R. C. (1991). Ear level recordings of the
long-term average spectrum of speech. Ear and Hearing, 12(1), 47-54.
Darwin, C. J., Denis McKeown, J., & Kirby, D. (1989). Perceptual compensation for
transmission channel and speaker effects on vowel quality. Speech
Communication, 8(3), 221–234.
Dechovitz, D. (1977a). Information conveyed by vowels: A confirmation. Haskins
Laboratory Status Report, 213–219.
Dechovitz, D. R. (1977b). Information conveyed by vowels: a negative finding. The
Journal of the Acoustical Society of America, 61(S1), S39–S39.
Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional
cues for consonants. The Journal of the Acoustical Society of America, 27(4),
769–773.
Di Lollo, V. (1964). Contrast effects in the judgment of lifted weights. Journal of
Experimental Psychology, 68(4), 383–387.
Dilley, L. C., & Pitt, M. A. (2007). A study of regressive place assimilation in
spontaneous speech and its implications for spoken word recognition. The Journal
of the Acoustical Society of America, 122, 2340-2353.
178
Dunn, H.K., & White, S.D. (1940). Statistical measurements on conversational speech.
The Journal of the Acoustical Society of America, 11(3), 278-288.
Ernestus, M. (2000). Voice assimilation and segment reduction in casual Dutch. A
corpus-based study of the phonology-phonetics interface. Utrecht: LOT.
Esposito, A. (2002). On vowel height and consonantal voicing effects: Data from Italian.
Phonetica, 591(4), 197-231.
Fant, G. (1960). Acoustic theory of speech production. The Netherlands: Mouton & Co.,
The Hague.
Fenn, K. M., Shintel, H., Atkins, A. S., Skipper, J. I., Bond, V. C., & Nusbaum, H. C.
(2011). When less is heard than meets the ear: Change deafness in a telephone
conversation. The Quarterly Journal of Experimental Psychology, 64(7), 1442–
1456.
Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and
conversational speech for normal-hearing and hearing-impaired listeners. The
Journal of the Acoustical Society of America, 112, 259-271.
Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract:
A study using magnetic resonance imaging. The Journal of the Acoustical Society
of America, 106(3), 1511-1522.
Frazer, T. C. (1993). " Heartland" English: variation and transition in the American
Midwest. Tuscaloosa, AL: University Alabama Press.
Glidden, C. M., & Assmann, P. F. (2004). Effects of visual gender and frequency shifts
on vowel category judgments. Acoustics Research Letters Online, 5(4), 132-138.
Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access.
Psychological Review, 105(2), 251-279.
Gottfried, T.L. (1984). Effects of consonant context on the perception of French vowels.
Journal of Phonetics, 12(2), 91-114.
Greenberg, S. (1999). Speaking in shorthand–A syllable-centric perspective for
understanding pronunciation variation. Speech Communication, 29(2), 159-176.
Halberstam, B., & Raphael, L. J. (2004). Vowel normalization: the role of fundamental
frequency and upper formants. Journal of Phonetics, 32(3), 423–434.
179
Halle, M., & Stevens, K. (1962). Speech recognition: A model and a program for
research. Information Theory, IRE Transactions on, 8(2), 155–159.
Helson, H. (1964). Adaptation-level theory. New York: Harper & Row.
Hillenbrand, J. M., Clark, M. J., & Nearey, T. M. (2001). Effects of consonant
environment on vowel formant patterns. The Journal of the Acoustical Society of
America, 109(2), 748-763.
Hillenbrand, J.M., & Gayvert, R.T. (2005). Open source software for experiment design
and control. Journal of Speech, Language, and Hearing Research, 48, 45-60.
Hinton, L., Moonwomon, B., Bremner, S., Luthin, H., Van Clay, M., Lerner, J., &
Corcoran, H. (2011). It’s not just the valley girls: A study of California English.
Proceedings of the annual meeting of the Berkeley Linguistics Society (Vol. 13).
Holt, L. L. (2005). Temporally nonadjacent nonlinguistic sounds affect speech
categorization. Psychological Science, 16(4), 305-312.
Holt, L. L. (2006). The mean matters: Effects of statistically defined nonspeech spectral
distributions on speech categorization. The Journal of the Acoustical Society of
America, 120(5), 2801-2817
Holt, L. L., Lotto, A. J., & Kluender, K. R. (2000). Neighboring spectral content
influences vowel identification. The Journal of the Acoustical Society of America,
108(2), 710-722.
Jacewicz, E., Fox, R. A., & Salmons, J. (2007). Vowel space areas across dialects and
gender. Proceedings of the XVIth Int. Congress of Phonetic (pp. 1465–1468).
Jacobsen, T., Schröger, E., & Sussman, E. (2004). Pre-attentive categorization of vowel
formant structure in complex tones. Cognitive brain research, 20(3), 473-479.
Johnson, K. (2004). Massive reduction in conversational American English. In
Spontaneous speech: data and analysis. Proceedings of the 1st session of the 10th
international symposium (pp. 29-54). Tokyo, Japan: The National International
Institute for Japanese Language.
Joos, M. (1948). Acoustic phonetics. Language, 24(2), 5-136.
Kahane, J. C. (1982). Growth of the human prepubertal and pubertal larynx. Journal of
Speech and Hearing Research, 25(3), 446-455.
180
Katz, W. F., & Assmann, P. F. (2001). Identification of children’s and adults’ vowels:
Intrinsic fundamental frequency, fundamental frequency dynamics, and presence
of voicing. Journal of Phonetics, 29(1), 23–51.
Krapp, G. P. (1925). The English language in America. New York: Frederick Ungar.
Kurath, H. (Ed.) (1939). The linguistic atlas of New England. Providence: Brown
University Press.
Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English:
Phonetics, phonology and sound change. Berlin, Germany: Mouton de Gruyter.
Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. The Journal
of the Acoustical Society of America, 29(1), 98–104.
Laing, E. J., Liu, R., Lotto, A. J., & Holt, L. L. (2012). Tuned with a tune: talker
normalization via general auditory processes. Frontiers in Psychology, 3, 1-9.
Liberman, A. M. (1996). Speech: A special code. Cambridge, MA: MIT Press.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967).
Perception of the speech code. Psychological Review, 74(6), 431–461.
Lindblom, B. (1962). Spectrographic study of vowel reduction. The Journal of the
Acoustical Society of America, 35(11), 1773-1781.
Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W.J.
Hardcastle & A. Marchal (Eds.), Speech production and speech modelling, (pp.
403–439). Dordrecht, Netherlands: Kluwer Academic Publishers.
Lindblom, B., & Sundberg, J. E. (1971). Acoustical consequences of lip, tongue, jaw,
and larynx movement. The Journal of the Acoustical Society of America, 50(4B),
1166-1179.
Lindblom, B., & Studdert-Kennedy, M. (1967). On the role of formant transitions in
vowel recognition. The Journal of the Acoustical Society of America, 42(4), 830–
843.
Linville, S. E. (2002). Source characteristics of aged voice assessed from long-term
average spectra. Journal of Voice, 16(4), 472-479.
Linville, S. E., & Rens, J. (2001). Vocal tract resonance analysis of aging voice using
long-term average spectra. Journal of Voice, 15(3), 323-330.
181
Lotto, A. J., & Holt, L. L. (2006). Putting phonetic context effects into context: A
commentary on Fowler (2006). Attention, Perception, & Psychophysics,68(2),
178-183.
Lotto, A.J., Ide-Helvie, D.L., McCleary, E.A., & Higgins, M.B. (2006). Acoustics of
clear speech from children with normal hearing and cochlear implants. The
Journal of the Acoustical Society of America, 119(5), 3341.
Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in speech perception:
Effect of preceding liquid on stop consonant identification. Attention, Perception,
& Psychophysics, 60(4), 602-619
Lotto A. J., & Sullivan S. C. (2007). Speech as a sound source. In W.A. Yost, R.R. Fay,
& A.N. Popper (Eds.), Auditory perception of sound sources, (pp. 281-305). New
York: Springer Science and Business Media, LLC.
MacMahon, M.K.C. (2007). The work of Richard John Lloyd (1846–1906) and “the
crude system of doctrine which passes at present under the name of Phonetics.”
Historiographia Linguistica, 34(2/3), 281–331.
Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, listener expectations,
and the perceptual accommodation of talker variability. Journal of Experimental
Psychology: Human Perception and Performance, 33(2), 391-409.
Malécot, A. (1956). Acoustic cues for nasal consonants: An experimental study involving
a tape-aplicing technique. Language, 32(2), 274–284.
McGowan, R. S. (1997). Vocal tract normalization for articulatory recovery and
adaptation. In K. Johnson & J. W. Mullennix (Eds.), Talker Variability in Speech
Processing (pp. 211–26). San Diego: Academic Press.
McGowan, R. S., & Cushing, S. (1999). Vocal tract normalization for midsagittal
articulatory recovery with analysis-by-synthesis. The Journal of the Acoustical
Society of America, 106(2), 1090-1105.
Mendoza, E., Valencia, N., Muñoz, J., & Trujillo, H. (1996). Differences in voice quality
between men and women: use of the long-term average spectrum (LTAS).
Journal of Voice, 10(1), 59-66.
Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. The Journal of the
Acoustical Society of America, 85(5), 2114-2134 .
182
Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an
account of vowel normalisation. Language and cognitive processes, 25(6), 808–
839.
Moon, S. J., & Lindblom, B. (1994). Interaction between duration, context, and speaking
style in English stressed vowels. The Journal of the Acoustical society of America,
96, 40-55.
Nakamura, M., Iwano, K., & Furui, S. (2007, April). The effect of spectral space
reduction in spontaneous speech on recognition performances. In Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on (Vol. 4, pp. IV-473). IEEE.
Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. The
Journal of the Acoustical Society of America, 85(5), 2088-2113.
Neel, A. T. (2008). Vowel space characteristics and vowel identification accuracy.
Journal of Speech, Language, and Hearing Research, 51(3), 574-585.
Nordström, P. -E . & Lindblom, B . (1975) A normalization procedure for vowel formant
data. Paper 212 presented at the International Congress of Phonetic Sciences,
Leeds, August 1975.
Peterson, Gordon E. (1952). The information-bearing elements of speech. The Journal of
the Acoustical Society of America, 24(6), 629–637.
Peterson, Gordon E. (1961). Parameters of vowel quality. Journal of Speech and Hearing
Research, 4(1), 10-29.
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels.
The Journal of the Acoustical Society of America, 24(2), 175–184.
Picheny, M. A., Durlach, N. I., & Braida, L. D. (1986). Speaking clearly for the hard of
hearing II: Acoustic characteristics of clear and conversational speech. Journal of
Speech, Language and Hearing Research, 29(4), 434-446.
Pittman, A. L., Stelmachowicz, P. G., Lewis, D. E., & Hoover, B. M. (2003). Spectral
characteristics of speech at the ear: Implications for amplification in children.
Journal of Speech, Language and Hearing Research, 46(3), 649-657.
Pluymakers, M. Ernestus, M., & Baayen, R.H. (2005). Lexical frequency and acoustic
reduction in spoken Dutch, The Journal of the Acoustical Society of America,
118(4), 2561-2569.
183
Potter, R. K., & Steinberg, J. C. (1950). Toward the specification of speech. The Journal
of the Acoustical Society of America, 22(6), 807–820.
Recasens, D. (1985). Coarticulatory patterns and degrees of coarticulatory resistance in
Catalan CV sequences. Language and Speech, 28(2), 97-114.
Remez, R. E., Rubin, P. E., Nygaard, L. C., & Howell, W. A. (1987). Perceptual
normalization of vowels produced by sinusoidal voices. Journal of Experimental
Psychology: Human Perception and Performance, 13(1), 40-61.
Rhode, W. S. (1971). Observations of the vibration of the basilar membrane in squirrel
monkeys using the Mössbauer tTechnique. The Journal of the Acoustical Society
of America, 49(4B), 1218–1231.
Robles, L., Rhode, W. S., & Geisler, C. D. (1976). Transient response of the basilar
membrane measured in squirrel monkeys using the Mössbauer effect. The Journal
of the Acoustical Society of America, 59(4), 926-939.
Sachs, M. B., & Young, E. D. (1979). Encoding of steady-state vowels in the auditory
nerve: Representation in terms of discharge rate. The Journal of the Acoustical
Society of America, 66(2), 470–479.
Sjerps, M.J., Mitterer, H., McQueen, J.M. (2011a). Listening to different speakers: On
the time-course of perceptual compensation for vocal-tract characteristics.
Neuropsychologia, (49), 3831-3846.
Sjerps, M. J., Mitterer, H., & McQueen, J. M. (2011b). Constraints on the processes
responsible for the extrinsic normalization of vowels. Attention, Perception, &
Psychophysics, 73(4), 1195–1215.
Story, B.H. (1995). Physiologically-based speech simulation using an enhanced wavereflection model of the vocal tract (Doctoral dissertation). University of Iowa,
Iowa City, IA.
Story, B. H. (2005a). A parametric model of the vocal tract area function for vowel and
consonant simulation. The Journal of the Acoustical Society of America, 117(5),
3231-3254.
Story, B. H. (2005b). Synergistic modes of vocal tract articulation for American English
vowels. The Journal of the Acoustical Society of America, 118(6), 3834-3859.
Story, B. H. (2007a). Time-dependence of vocal tract modes during production of vowels
and vowel sequences. The Journal of the Acoustical Society of America, 121(6),
3770-3789.
184
Story, B. H. (2007b). A comparison of vocal tract perturbation patterns based on
statistical and acoustic considerations. The Journal of the Acoustical Society of
America, 122(4), EL107-EL114.
Story, B. H., & Bunton, K. (2010). Relation of vocal tract shape, formant transitions, and
stop consonant identification. Journal of Speech, Language and Hearing
Research, 53(6), 1514-1528.
Story, B. H., & Titze, I. R. (1998). Parameterization of vocal tract area functions by
empirical orthogonal modes. Journal of Phonetics, 26(3), 223–260.
Story, B. H., Titze, I. R., & Hoffman, E. A. (1996). Vocal tract area functions from
magnetic resonance imaging. The Journal of the Acoustical Society of
America, 100(1), 537-554.
Strange, W., Weber, A., Levy, E.S., Shafiro, V., Hisagi, M., & Nishi, K. (2007). Acoustic
variability within and across German, French, and American English vowels:
Phonetic context effects. The Journal of the Acoustical Society of America,
122(2), 1111-1129.
Sussman, H. M. (1986). A neuronal model of vowel normalization and representation.
Brain and language, 28(1), 12–23.
Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on
the auditory representation of American English vowels. The Journal of the
Acoustical Society of America, 79(4), 1086-1100.
Thomas, E. R. (2001). An acoustic analysis of vowel variation in New World English.
Durham, NC: Duke University Press.
Titze, I. R. (1984). Parameterization of the glottal area, glottal flow, and vocal fold
contact area. The Journal of the Acoustical Society of America, 75(2), 570-580.
Traunmüller, H. (1981). Perceptual dimension of openness in vowels. The Journal of the
Acoustical Society of America, 69(5), 1465–1475.
Van Bergem, D. R., Pols, L. C. W., & Beinum, F. J. (1988). Perceptual normalization of
the vowels of a man and a child in various contexts. Speech Communication, 7(1),
1–20.
Verbrugge, R. R., Strange, W., Shankweiler, D. P., & Edman, T. R. (1976). What
information enables a listener to map a talker’s vowel space? The Journal of the
Acoustical Society of America, 60(1), 198-212.
185
Vorperian, H. K., Kent, R. D., Lindstrom, M. J., Kalina, C. M., Gentry, L. R., & Yandell,
B. S. (2005). Development of vocal tract length during early childhood: A
magnetic resonance imaging study. The Journal of the Acoustical Society of
America, 117(1), 338-350.
Vorperian, H. K., Wang, S., Chung, M. K., Schimek, E. M., Durtschi, R. B., Kent, R. D.,
Ziegert, A.J., & Gentry, L. R. (2009). Anatomic development of the oral and
pharyngeal portions of the vocal tract: An imaging study. The Journal of the
Acoustical Society of America, 125(3), 1666-1678.
Warner, N. L. (2005). Reduction of flaps: Speech style, phonological environment, and
variability. The Journal of the Acoustical Society of America, 118(3), 2035.
Warner, N., Fountain, A., & Tucker, B. V. (2009). Cues to perception of reduced flaps.
The Journal of the Acoustical Society of America, 125, 3317-3327.
Watkins, A. J. (1991). Central, auditory mechanisms of perceptual compensation for
spectral-envelope distortion. The Journal of the Acoustical Society of America,
90(6), 2942-2955.
Watkins, Anthony J., & Makin, S. J. (1994). Perceptual compensation for speaker
differences and for spectral-envelope distortion. The Journal of the Acoustical
Society of America, 96(3), 1263–1282.
Watkins, A. J., & Makin, S. J. (1996a). Some effects of filtered contexts on the
perception of vowels and fricatives. The Journal of the Acoustical Society of
America, 99(1), 588-594.
Watkins, Anthony J., & Makin, S. J. (1996b). Effects of spectral contrast on perceptual
compensation for spectral-envelope distortion. The Journal of the Acoustical
Society of America, 99(6), 3749–3757.
Williams, D. (1986). Role of dynamic information in the perception of coarticulated
vowels (Doctoral dissertation). University of Connecticut, Storrs, CT.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement