INDIVIDUAL DIFFERNECES IN DEGRADED SPEECH PERCEPTION By Kathy M. Carbonell

INDIVIDUAL DIFFERNECES IN DEGRADED SPEECH PERCEPTION By Kathy M. Carbonell
INDIVIDUAL DIFFERNECES IN DEGRADED SPEECH PERCEPTION
By
Kathy M. Carbonell
__________________________
A Dissertation Submitted to the Faculty of the
DEPARTMENT OF SPEECH, LANGUAGE & HEARING SCIENCES
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2016
1
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation
prepared by Kathy M. Carbonell, titled Individual Differences in Degraded Speech
Perception and recommend that it be accepted as fulfilling the dissertation requirement
for the Degree of Doctor of Philosophy.
_______________________________________________________________________
Date: 12/11/15
Huanping Dai
_______________________________________________________________________
Date: 12/11/15
Andrew Wedel
_______________________________________________________________________
Date: 12/11/15
Stephen Wilson
_______________________________________________________________________
Date: 12/11/15
Natasha Warner
_______________________________________________________________________
Date: 12/11/15
Suzanne Curtin
Final approval and acceptance of this dissertation is contingent upon the candidate’s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
________________________________________________ Date: 12/11/15
Dissertation Director: Andrew Wedel
________________________________________________ Date: 12/11/15
Dissertation Director: Huanping Dai
2
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University Library
to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission, provided
that accurate acknowledgment of source is made. Requests for permission for extended
quotation from or reproduction of this manuscript in whole or in part may be granted by
the head of the major department or the Dean of the Graduate College when in his or her
judgment the proposed use of the material is in the interests of scholarship. In all other
instances, however, permission must be obtained from the author.
SIGNED: Kathy Marie Carbonell
3
ACKNOWLEDGEMENTS
I owe a great deal of gratitude to several individuals who contributed to the
success of this work and my years as a doctoral student. To my committee members,
Andy Wedel, Huanping Dai, Natasha Warner, Stephen Wilson and Suzanne Curtin –
thank you for your continuous support and feedback. I now have a special appreciation
for the influence that a committee with such diverse backgrounds can bring to a single
body of work. It has been intellectually stimulating and deeply rewarding.
Special thanks are also due to Dan Brenner, who graciously took time away from
writing his own dissertation to help me with data analysis. I would also like to thank
Jessamyn Schertz and Ran Liu for allowing me access to stimuli and programs for several
of my experimental tasks.
I also wish to express my sincere appreciation for all of those who supported me
outside of my academic career - my parents, my grandparents, my siblings, especially my
sister, Lisa, whose unfailing belief in my abilities is a constant source of motivation.
Finally, and most of all, I would like to thank Andrew Lotto; thank you for instilling in
me a deep love of ideas and for inspiring this journey in the first place.
“One never reaches home,' she said. 'But where paths that have an affinity for each other
intersect, the whole world looks like home, for a time.”
-Herman Hesse, Demian
4
DEDICATION
To the outliers
Thanks for keeping it interesting.
5
TABLE OF CONTENTS
LIST OF FIGURES……………………………………………………………………….8
LIST OF TABLES……………………………………………………………………….12
ABSTRACT………………………………………………………………..……….……13
CHAPTER 1 Introduction……………..…………………………………………….…...14
1.1 The problem of variability in degraded speech perception……………….14
1.2 Attempts to account for the variability in degraded speech perception
performance…............................………………………………….…..................31
1.3 Proposal…………………………………………………………………...36
CHAPTER 2 Reliability of individual differences in degraded speech perception….37
2.1 Introduction………………………………………………………………..37
2.2 Experiment 1 Degraded Speech Perception……………………………….38
2.2.1 Methods…………………………………………………………38
2.2.2 Preliminary study: determining equal difficulty across tasks…...42
2.3 Current Study …………………………………………………………….43
2.3.1 Results……………………………………………………….….48
2.3.2 Interim summary and discussion…….……………………. ……57
CHAPTER 3 Reliability of individual differences in cue weighting adaptation………59
3.1 Introduction …………………………………………………………...….59
3.1.1 Cue Weighting…………………………………………………..60
3.1.2 Adaptability……………………………………………………..65
3.2 Experiment 2 Consonant & Vowel Adaptation Paradigm………………..69
3.2.1 Methods for Vowel Adaptation Paradigm…………………………69
3.2.2 Consonant Adaption Paradigm ………………………………...….72
3.2.2.1 Methods for Consonant Adaptation Paradigm…………...73
3.3 Results ……………………………………………………………………….77
3.4 Interim summary and discussion…………………………………………….85
CHAPTER 4 Predicting individual differences in degraded speech perception………...87
4.1 Introduction ……………………………………………………………….…87
6
TABLE OF CONTENTS - Continued
4.1.2 Working memory capacity…………………………………………90
4.2 Experiment 3 Predicting degraded speech performance…...……...…………93
4.2.1 Methods…………………………………………………………….93
4.2.2 Results……………………………………………………………...94
4.3 Interim summary and discussion …………………………………………....99
CHAPTER 5 Discussion ………………………………………………………….……101
5.1 Summary of main findings …………………………………………………101
5.2 Conclusions and future directions..…………………………………………107
APPENDIX A…………………………………………………………………….…….113
APPENDIX B……………………………………………………………….………….114
REFERENCES...……………………………………………………………………….116
7
LIST OF FIGURES
Figure 1 Waveform recordings and corresponding spectrograms for the target word
“goat” for a) normal speech and b) speech in babble noise (-3 SNR)…………………...21
Figure 2 Waveform recordings and corresponding spectrograms for the target word
“goat” for a) normal speech and b) 8-channel noise-vocoded speech …………………..23
Figure 3 Individual subjects performance on key word recognition for 6-band noise
vocoded speech…………………………………………………………………………..25
Figure 4 Waveform recordings and corresponding spectrograms for the sentence “appear
to wait then turn” for a normal speaker and a speaker with hypokinetic dysarthria……..26
Figure 5 Waveform recordings and corresponding spectrograms for a sentence with
normal speech and high-pass filtered speech……………………...……………………..28
Figure 6 Transcription scores represented by words correct for individual subjects........29
Figure 7 Waveform recordings and corresponding spectrograms for the target word
“goat” for normal speech and time-compressed speech…………………………………30
Figure 8 Steps involved in converting clear speech (left spectrogram) into noise-vocoded
speech (right spectrogram)……………………………………………………………….40
Figure 9 Example of the experiment screen view for the Degraded Speech word
recognition task…………………………………………………………………………..45
8
LIST OF FIGURES – Continued
Figure 10 Example of the visual sequence of the Operation Span Task………….……46
Figure 11 Individual data for word recognition scores on the 8-channel noise-vocoded
speech task………………………………………………………………………………51
Figure 12 Individual data for word recognition scores on the time-compressed (by 30%)
speech task………………………………………………………………………………52
Figure 13 Individual data for word recognition scores on the speech in babble noise (-3
SNR) task………………………………………………………………………………53
Figure 14 Scatterplot of Speech in Babble noise scores on Session 1 and Session 2….54
Figure 15 Scatterplot of individual scores on Time-Compressed Speech for Sesion 1 and
Session 2…………………………………………………………………………………55
Figure 16 Scatterplot of individual scores on Noise-Vocoded Speech for Session 1 and
Session 2…………………………………………………………………………………56
Figure 17 Re-plot of Center Frequency (CF) weighting for individual subjects in tone
categorization from Holt & Lotto (2006)………………………………………………..64
Figure 18 Schematic of Vowel adaptation task by block. Purple and orange diamonds
represent the test stimuli in each block…………………………………………………..72
Figure 19 Schematic of Consonant adaptation task by block. Purple and orange
diamonds represent test stimuli in each block…………………………………………...74
9
LIST OF FIGURES – Continued
Figure 20 Average percent "set" responses for test stimuli in Canonical and Reverse
blocks in Session 1……………………………………………………………………….78
Figure 21 Average percent "set" responses for test stimuli in Canonical and Reverse
blocks in Session 2……………………………………………………………………….78
Figure 22 Average percent "pa" responses for test stimuli in Canonical and Reverse
blocks in Session 1……………………………………………………………………….79
Figure 23 Average percent "pa" responses for test stimuli in Canonical and Reverse
Blocks in Session 2………………………………………………………………...…….79
Figure 24 Scatterplot of individual scores on the Vowel Adaptation task for Session 1
plotted against Session 2…………………………………………………………………81
Figure 25 Scatterplot of individual scores on the Consonant Adaptation task for Session
1 plotted against Session 2……………………………………………………………….82
Figure 26 Scatterplot of individual scores averaged across both sessions for the
Consonant Adaptation tasks plotted against the Vowel Adaptation task………………..83
Figure 27 Individual data for Vowel Adaptation score and percent correct word
recognition on the Speech in Babble Noise task…………………………………………96
Figure 28 Individual data for Vowel Adaptation score and percent correct word
recognition on the Compressed Speech task……………………………………………..97
10
LIST OF FIGURES – Continued
Figure 29 Individual data for Vowel Adaptation score and percent correct word
recognition on the Noise-Vocoded Speech task…………………………………………97
Figure 30 Individual data for Consonant Adaptation score and percent correct word
recognition on the Compressed Speech task…………………………………………….98
Figure 31 Individual data for Consonant Adaptation score and percent correct word
recognition on the Noise-Vocoded Speech task…………………………………………98
11
LIST OF TABLES
Table 1 Overall means and standard deviations for each degraded speech task for
Session 1 and Session 2 (indicated by day)……………………………………………...48
Table 2 Correlational analyses across degraded speech tasks for Session 1…………….49
Table 3 Correlational analyses across degraded speech tasks for Session 2……………49
Table 4 Overall means and standard deviations for word recognition scores on each
degraded speech task averaged across both sessions…………………………………….50
Table 5 Overall means and standard deviations for each adaptation task for sessions 1
and 2……………………………………………………………………………………...81
Table 6 Correlational analyses within and across adaptation tasks for Session 1 and
Session 2…………………………………………………………………………………84
Table 7 Overall correlations between average performance on each adaptation task,
degraded speech task and working memory task………………………………………...95
12
ABSTRACT
One of the lasting concerns in audiology is the unexplained individual differences
in speech perception performance even for individuals with similar audiograms. One
proposal is that there are cognitive/perceptual individual differences underlying this
vulnerability and that these differences are present in normal hearing (NH) individuals
but do not reveal themselves in studies that use clear speech produced in quiet (because
of a ceiling effect). However, previous studies have failed to uncover
cognitive/perceptual variables that explain much of the variance in NH performance on
more challenging degraded speech tasks. This lack of strong correlations may be due to
either examining the wrong measures (e.g., working memory capacity) or to there being
no reliable differences in degraded speech performance in NH listeners (i.e., variability in
performance is due to measurement noise). The proposed project has 3 aims; the first, is
to establish whether there are reliable individual differences in degraded speech
performance for NH listeners that are sustained both across degradation types (speech in
noise, compressed speech, noise-vocoded speech) and across multiple testing sessions.
The second aim is to establish whether there are reliable differences in NH listeners'
ability to adapt their phonetic categories based on short-term statistics both across tasks
and across sessions; and finally, to determine whether performance on degraded speech
perception tasks are correlated with performance on phonetic adaptability tasks, thus
establishing a possible explanatory variable for individual differences in speech
perception for NH and hearing impaired listeners.
13
CHAPTER 1
Introduction
1.1 The problem of variability in degraded speech perception
Trying to understand speech in the presence of background noise is notoriously
problematic for hearing impaired (HI) listeners. The reason for this difficulty lies in the
loss of hearing hair cells that not only provide amplification for low amplitude sound, but
also capture the spectral fine structure of the speech signal - essentially resulting in a loss
of speech clarity regardless of compensatory amplification provided from a hearing aid
device. There is a general presumption that the ability of HI listeners to perceive speech
in background noise (or in other challenging listening environments) is dependent upon a
measure of hearing sensitivity or threshold. However, attempts to correlate speech
perception scores with hearing sensitivity have been widely unsuccessful. Yoshioka and
Thornton (1980) attempted to predict individual speech discrimination scores from
hearing thresholds by examining audiograms (graphical representations of hearing
sensitivity as measured by detection of pure tone frequencies ranging from 125 –
8,000Hz) of 361 patients as well as their corresponding Speech Discrimination Scores
(SDS) in quiet. The SDS is a measure of one’s ability to both detect and identify
monosyllabic words presented at an easily detectable intensity level. The authors were
able to test the predictive “prowess” of the audiogram by using a variety of statistical
methods (stepwise multiple regression, smear-and-sweep analysis, and a clinical
14
classification of the audiometric configuration). They observed an increase in variability
of speech discrimination scores with increased hearing loss, and therefore a relatively
poor relationship between audiometric thresholds and speech-intelligibility scores,
suggesting that detectability of sound is not necessarily indicative of intelligibility.
There also appears to be a lack of consistency between measures of speech
perception in quiet compared to speech perception in noise. Several investigators have
examined the relationship between HI listeners’ speech intelligibility in quiet vs. noise.
Nabelek and Pickett (1974) examined five listeners with similar hearing sensitivity moderate degrees of high-frequency hearing loss (HL). When examining the individual
subject data (while not the primary purpose of the experiment), the authors noted that the
listeners’ difference in speech recognition scores in quiet and in noise varied from
approximately 23 – 67% at a signal-to-noise ratio (SNR) of -5 dB, suggesting that similar
levels of hearing sensitivity do not necessarily reflect similarities in speech perception
abilities, especially in noise.
Plomp (1983) examined speech perception abilities of a group of 147 hearingimpaired listeners with varying degrees and configuration of sensorineural hearing loss
(SNHL) in both quiet and in noise. The data were taken across several studies comparing
HI listeners’ speech reception thresholds (SRTs) in quiet and in the presence of speechshaped noise (Duqesnoy & Plomp, 1983; Festen & Plomp, 1983; Plomp, 1983). The SRT
is determined by the lowest level (amplitude) at which a listener can accurately repeat
50% of a list of key words - familiar “spondees” (two-syllable words with equal stress in
both syllables, e.g “baseball”) embedded in a sentence. Correlational analyses between
listeners’ SRTs in quiet and noise reveal a very poor relationship (r = 0.01), indicating
15
that speech recognition in quiet is not predictive of one’s ability to perceive speech in
noise.
Crandell (1991) also uncovered this disparity between HI listeners speech
performance in quiet vs. noise in a study of hearing aid selection SRTs even when
controlling for hearing sensitivity. For this investigation, 32 subjects with similar degrees
of hearing loss were selected (mild to moderate, high frequency SNHL). Subjects’ speech
reception thresholds (SRTs) were measured with speech stimuli consisting of high
predictability sentences from the speech intelligibility in noise (SPIN) test (Bilger, 1984;
Kalikow, Stevens & Elliot, 1977) in quiet and in multitalker babble noise. The competing
background noise for the noise condition was multitalker babble presented at various
amplitudes (75 and 85 dB). Comparisons between speech recognition in quiet and in
noise revealed nonsignificant relationships (r = 0.34 at 75 dB noise level, and 0.09 at 85
dB noise level). There were also reported follow-up test/re-test measures with the same
group of subjects, indicating reliable variability in performance.
A similar frustrating variability is evident across listeners with cochlear implants
(surgically implanted hearing device for the deaf or profoundly hard of hearing) in that
speech perception performance is notoriously difficult to predict based on etiology of the
hearing impairment, masking functions, detection thresholds, etc. Even when taking into
account the differences in the implant hardware, speech perception, or outcome
performance is not predictable. Spahr, Dorman & Loiselle (2007) demonstrated this
problem in a study that examined speech perception performance for CI users post
implantation. The investigators divided a group of 39 subjects with CIs into 3 different
groups based on the type of speech processor they had on their device (group 1: CII
16
Bionic Ear BTE; group 2: Esprit3G BTE; group 3: Tempo + BTE). Within each group,
subjects were also matched on age, duration of deafness, length of experience with their
device, and monosyllabic word recognition scores in quiet. The groups were assessed on
a number of tests of speech understanding, voice discrimination and melody recognition
for the purpose of comparing the effects of the different types of CI processors on
performance. Results of the study revealed some improvement for group 1 (CII
processors) across all tasks when the input dynamic range was increased, but it is worth
noting the amount of variability in performance across participants despite such
controlled parameters. Prior to matching subjects on age, duration of deafness, etc., an
initial group of 76 patients were recruited after meeting the criteria of scoring greater than
40% correct on a monosyllabic word test. Of this first group of participants, separated by
CI processor type, the range of variability on a word recognition test in quiet spanned 25
– 100 % correct (CII processors), 2-90% correct (3G processors), and 15-95% correct
(Tempo+ processors). Even when parsed down to the 39 subjects matched on all the
criteria mentioned previously, the range of variability on the same test of word
recognition spanned 50-90% correct (across all 3 CI processor type groups). These data
are telling of the inherent variability in speech perception ability that exists across HI
listeners even when matched on criteria that would otherwise be intuitive contributors to
this variability (e.g. age, duration of deafness, etc.).
To date, there exist few effective predictors of CI performance, although there
have been attempts to correlate performance post-implant with a number of
electrophysiological measures. For example, Hughes & Still (2008) found strong
17
correlations between psychophysical and physiological measures of forward masking and
the Evoked Compound Action Potential (ECAP), but neither correlated with speech
perception performance. Similarly, in a study by Green and colleagues, the only
independent predictor of speech perception outcome for CI users was duration of
deafness prior to implantation, which only accounted for 9% of the variability (Green et
al., 2007). Taken together, these findings suggest that perception of degraded speech for
HI listeners raises additional challenges that are independent of the processing of speech
sounds in quiet and listeners differ in their ability to accommodate these challenges.
Despite the prevalence of variability in speech perception performance across HI
listeners, there exists little information that attempts to account for the individual
differences in susceptibility to noise or challenging listening environments. Not only
would a better understanding of causes behind individual differences benefit the goal of
predicting outcome measures for hearing aids or CIs, but it also may have some
implications as to why listeners receive varying degrees of benefit from specific
rehabilitative strategies. Certain hearing aid signal processing strategies, for example,
have been shown to improve speech recognition abilities in some HI listeners, but not
others, despite similar audiometric configurations. Newman and colleagues experimented
with 3 different hearing aid signal processing strategies by presenting simulations of
speech in babble noise filtered through the different processors to 8 HI listeners with
similar degrees of hearing loss (Newman, Levitt, Mills & Schwander, 1987). Each
listener was tasked with judging the intelligibility of each processing strategy after being
presented with the simulations several times each. Ultimately, there was no single hearing
aid processing strategy that proved to be most intelligible for every listener. Van Tasell,
18
Larsen & Fabry (1988) attempted to examine the effects of a specific HA processor type
(adaptive filtering) on a group of HI listener speech perception abilities in noise. They
presented 6 HI listeners with monosyllabic and bisyllabic (spondee) words mixed with
speech-shaped noise and compared individual SRTs with and without the adaptive filter
processing in their hearing aids. The investigators were unable to elicit consistent
improvements in speech performance for the HI listeners using an adaptive filtering due
to the amount of variability in performance across the listeners.
Variability in Normal Hearing Populations
It is tempting to attribute all of this variability in outcomes for hearing impaired
listeners to the underlying hearing deficits, but several studies have demonstrated that
differences in speech intelligibility go beyond audibility. In a study of speech perception
and aging, Fullgrabe, Moore & Stone (2014) matched a group of older, normal hearing
participants (age range = 60-70 years) with a group of young, normal hearing participants
(age range = 18-27 years) and compared auditory and cognitive effects on speech
intelligibility. When comparing intelligibility of speech in the presence of various
backgrounds (i.e. quiet, two-talker babble noise), scores from the older group were
consistently lower than the younger group, despite matched audiograms. Tests of
cognitive abilities, however, were lower for the older group as well and correlated
positively with speech in noise scores. The results suggest that the declines in speech
perception in older persons is not the result of age-related hearing loss or audiometric
sensitivity, but instead more likely caused by cognitive or perceptual changes that occur
as we age.
19
Watson (1991; Supernant & Watson, 2001) proposed that individual differences in
speech perception performance among HI individuals may be due to cognitive/perceptual
differences that are present in the normal hearing (NH) population. For careful speech
presented in quiet, these differences would not be apparent because NH performance is
near ceiling. However, if individual differences in speech perception abilities exist, they
would become apparent as variability in performance for challenging listening
conditions. Several studies have documented such variability across several types of
signal degradation.
Speech in Noise.
The most common or widely experienced form of degraded speech is speech
presented in the presence of background noise, for both HI and normal hearing listeners
(see Figure 1). The task for the listener is to segregate a single talker amongst extraneous
competing sounds or talkers as in a restaurant, bar, theatre, etc. This challenging listening
environment provides a spectrally degraded speech signal which when presented at low
signal to noise ratios will result in poor speech perception performance even by listeners
with normal pure-tone audiograms (Middelweerd et al., 1990; Rodriguez et al., 1990),
even those with almost perfect discrimination in quiet (Rupp & Phillips, 1969). Not only
does overall speech perception performance suffer in background noise, but the amount
of decline will vary across listeners, independent of hearing sensitivity in quiet.
Elsabbagh and colleagues (2011) tested normal hearing adults and children both with and
without Williams Syndrome (WS) in a word recognition task both in quiet and in noise
for the purpose of assessing the relationship between speech perception in noise for
20
a.
b.
Waveform recordings and corresponding spectrograms for the target word “goat” for
a) normal speech and b) speech in babble noise (-3 SNR)
Figure 1 .
individuals with WS and their rating of the severity of hyperacusis (increased sensitivity
to certain frequencies and loudness levels). Participants heard six pairs of words – each
pair differing in a single consonant- and were asked to judge whether the word pair was
the ‘same’ or ‘different’. The word pairs were presented in quiet, in low amplitude
speech-shaped noise (SNR = -2.3 dB), or loud speech-shaped noise (SNR = -7.3 dB). The
group of NH, typically developing adult listeners was meant to serve as a comparison
group, but it is worth noting that even as a control (for age and hearing sensitivity), they
displayed substantial variability in performance in noise (71% - 96% correct in low noise;
54% - 83% in loud noise) compared to quiet (92% - 100% correct) (Elsabbagh et al.,
2011).
Even when the cause of degradation in the speech signal is simply distance from the
source, normal hearing listeners will show tremendous variability in speech perception
21
ability (Meyer, Dentel & Meunier, 2013). In this study, 36 NH participants, ages 18–30
years-old, took part in a distance simulation test examining the effects of natural
background noise as a result of distance influenced speech intelligibility. Subjects were
seated in front of a computer and were asked to identify single words being presented
over a loud speaker, mixed with natural background noise corresponding to 12 distances
from the speaker (ranging from 11 to 33 meters). As one would suspect, results revealed
declines in intelligibility as distance from the source increases, but the amount of
variability in word recognition ability at each distance point is astounding. As an
example, at 23 meters from the source, word recognition scores ranged from 20% to 79%
correct, much like the variability seen in HI populations when presented with speech in
background noise.
Noise-Vocoded Speech.
NH listeners also display large variations in the ability to perceive noise-vocoded
speech (Shannon et al., 1995) – a type of degradation used to simulate speech through a
cochlear implant. It is created by passing a speech waveform through a set of passband
filters covering a restricted frequency range. The amplitude envelope of the output from
each filter is extracted either by low-pass filtering or the Hilbert transform. White noise is
then filtered through the original filter band and each filter’s output is modulated by the
appropriate amplitude envelope from the speech signal. These amplitude modulated
noises are then added together to create a signal that retains much of the temporal
characteristics of the speech signal without the fine spectral detail. This type of
22
degradation presents a particularly challenging task for the listener – understanding
speech without the presence of fine spectral cues – while temporal fluctuations in the
amplitude envelope are preserved, pitch and formant saliency are weakened (see Figure
2).
a.
b
Waveform recordings and corresponding spectrograms for the target word “goat”
for a) normal speech and b) 8-channel noise-vocoded speech
Figure 2.
A recent study by Carbonell & Lotto (2013) examined carrier phrase effects on
intelligibility of noise vocoded speech. The purpose of the study was to test whether a
carrier phrase that did not provide semantic predictability would have any effect on the
intelligibility of a target word – a notion based on previous work suggesting that the mere
presence or absence of sound preceding a speech token could alter one’s perception of
that token (Carbonell & Lotto, 2012). For this study, normal hearing listeners were
23
presented a series of 250 monosyllabic English words in one of two conditions: 1) in
isolation or 2) proceeded by the carrier phrase “the next word on the list is”. A sixchannel (i.e. six bandbass filters) noise vocoding was applied to all stimuli in both
conditions. The task for the listener was to type the last word they heard in the sentence
(carrier phrase condition) or type the isolated word that they heard (isolation condition).
Responses were scored as percent correct word recognition and percent correct
identification of the initial consonant of the target word. Interestingly, there was a
significant improvement in both words and consonants identified when the target word
was preceded by the carrier phrase – an intelligibility boost of 36% for consonants and
20% for words. On average, all subjects benefited from hearing the carrier phrase before
the target words, but upon examination of the individual differences, there is quite a
spread in performance ability across all tasks. Figure 3 provides a graphic of the
variability in performance observed for the word recognition task, again for college-age,
normal hearing listeners. As one can readily see, there is quite a range of performance –
from 7% to 30% correct. As one might expect, these results fall in line with the
variability in performance for individuals with CIs, as the degraded signals are spectrally
comparable. This parallel between NH and HI performance variability hints at the
possibility that the variability in performance seen in the HI population may in fact be a
reflection of the variability inherent in the general population that becomes apparent in
the challenging listening conditions of hearing loss.
24
Figure 3 . Individual subjects performance on key word recognition for 6-band
noise vocoded speech. Horizontal red bar represents the mean (15% correct).
The range of scores is 7-30% correct (Carbonell & Lotto, 2013)
Dysarthric Speech.
One type of degradation that violates natural speech rhythm and segmentation is
Dysarthric speech, which is produced by talkers with a neuro-muscular disorder that
disrupts production. There are several different types of dysarthria that result in various
errors in verbal speech production, specifically, prosodic, voicing and articulatory errors.
The most common form of dysarthria is hypokinetic - a symptom of Parkinson’s disease
that results in rapid, monotone speech, with imprecise articulations (see Figure 4). As a
naturally occurring type of degraded speech (as opposed to laboratory manipulations of
speech), dysarthric speech is a particularly important example of degradation of the
signal. Mattys & Liss (2008) emphasize the importance of examining this kind of
25
naturally occurring speech degradation for the purpose of successfully modeling speech
processing that accommodates the full range of variation that a listener will encounter.
a.
b.
Figure 4. Waveform recordings and corresponding spectrograms for the sentence “appear
to wait then turn” for a) normal speaker and b) speaker with hypokinetic dysarthria
(symptomatic of Parkinson’s disease).
Choe, Liss, Lotto, Azuma & Mathy (2009) examined the perceptual outcome of
normal hearing listeners presented with dysarthric speech. Forty phrases were recorded
by 12 patients with mixed dysarthria from amyotrophic lateral sclerosis (ALS). Acoustic
characteristics of mixed dysarthria in ALS include abnormally high or low F0, abnormal
shimmer (i.e. variation in intensity) and jitter (i.e. variation in frequency), longer vowel
duration, indistinctive voice onset time, and decline in F2 slope (Duffy, 1995). One
hundred normal hearing, college-age students participated in this study and were asked to
transcribe the phrases produced by the speakers with dysarthria (scored as percent correct
26
words and phonemes accurately transcribed), along with several other tasks: 1) sentence
verification and minimal pair discrimination in multitalker babble noise (produced by
normal, adult speakers); 2) spectral and temporal context effects; 3) working memory
span. Results from this study revealed no correlations between normal hearing listeners’
working memory span and speech perception performance on the degraded transcription
task. There were also no correlations between the context effects tasks and transcription
scores. However, it is interesting to note the remarkable range of normal hearing listener
performance on the transcription task – from 22% to 70% correct phonemes transcribed
and 25% to 58% correct words transcribed. While not exactly telling of the origin of
these large individual differences, these results are symptomatic of the large variability
observed in speech performance in adverse conditions, even for listeners with similar
hearing sensitivity.
High Frequency Energy.
Another example of performance variability under degraded listening conditions
comes from a study of speech acoustics above 5kHz. Vitela, Monson & Lotto (2015)
sought to determine what, if any, information in the acoustic signal is present at
frequencies above what is considered important for speech comprehension since most
studies of speech perception focus on acoustic information present below 6kHZ (Moson
et al., 2014). Stimuli for this study consisted of consonant-vowel (CV) pairings of the
following consonants: /p, b, t, d, k, g, f, v, m, n, s, z, ʃ/ and vowels: /i, æ, ɑ, o, u/,
produced by both a male and female speaker. All stimuli were bandpass filtered with
cutoff frequencies of 5.7 and 20 kHz. (Figure 5. displays a spectrogram of the stimuli) A
27
low-frequency masker was added to ensure the listeners were making use solely of the
high frequency information and not any distortion products present in the lower
frequencies. NH listeners were tasked with categorizing the consonant and vowels for
each CV presentation as two separate tasks. For the vowel categorization task, the mean
accuracy was 15.8% correct (chance = 9.1%), and for the consonant categorization task,
mean accuracy was 51.5% correct (chance = 4.3%). While all participants performed
above chance, indicating that there is indeed relevant phonetic information in the high
frequencies, the distribution of transcription scores were surprisingly (or perhaps not so
surprisingly) variable. The range of percent correct scores for transcription of CVs by the
male speaker was 25% to 85% correct (see Figure 6).
a.
b.
Figure 5. Waveform recordings and corresponding spectrograms for the sentence
“the birch canoe slid on the smooth planks” for a) normal speech and b) high-pass
filtered speech.
28
% Correct Word ID
Consonant ID
80
70
60
50
40
30
20
10
0
Male
Female
Voice
Figure 6. Transcription scores represented by words correct for individual
subjects. Data replotted from Vitela, Monson & Lotto (2015). Thick black line
represents mean.
Temporally Degraded Speech.
The majority of degraded speech examples, with the exception of dysarthric speech
perception, degrade some aspect of the spectral resolution of the signal. The temporal
resolution of the speech signal, or speech rate, can also be manipulated to serve as an
example of a challenging listening environment. Time-compressed speech is a common
manipulation to the speech signal that will result in a temporal degradation. While
spectral fine structure is intact, the rate of presentation, or time-varying amplitude
modulations of the signal are highly degraded, as shown in Figure 7.
29
a.
b.
Figure 7. Waveform recordings and corresponding spectrograms for the target word
“goat” for a) normal speech and b) time-compressed speech by 30 percent of the
original signal length.
Vaughan and Letowski (2015) examined whether effects of aging could be reflected
in temporal aspects of auditory processing. 54 participants in three different age groups
(Group 1 M = 28.2 years of age; Group 2 M = 48.4 years of age; Group 3 M = 69.1 years
of age), all with normal hearing (as per pure-tone audiometric screening) were presented
various speech intelligibility tests at different time-compressed rates. The study consisted
of four types of speech tasks, all varying the rate of speech compression by 60%, 70%
and 80%. While the older group (Group 3) did perform more poorly on all tests of
speech intelligibility regardless of compression rate, perhaps more interesting is the
amount of variability noted in this group’s speech intelligibility rating test at 70% timecompression. Average intelligibility rating was 28.4% with a standard deviation of
21.1%. These demonstrations of large variability in performance for normal hearing
30
listeners suggests there are inherent individual differences in auditory/cognitive
processing that manifest themselves when speech perception becomes challenging.
It is tempting to speculate that these differences in degraded speech processing may
also account for some of the immense variability seen in hearing-impaired populations
(Watson, 1991; Suprenant & Watson, 2001). That is, basic cognitive/perceptual
differences that become evident in acute degraded speech conditions for NH listeners
may also underlie the performance differences seen for HI listeners who are chronically
dealing with a degraded signal. However, to date, there has been little success in
determining which differences in auditory /cognitive processes are responsible for this
variability.
1.2 Attempts to account for the variability in degraded speech perception
performance
One of the most obvious places to look for differences underlying variability in
speech performance is in individual performance on basic psychoacoustic tasks.
However, several previous studies examining individual differences in NH listeners’
spectral and temporal discrimination abilities for a wide variety of non-speech sounds
have failed to uncover strong correlations between psychoacoustic abilities (or measures
of peripheral encoding) and speech perception performance. For example, EspinozaVaras and Watson (1988) attempted to assess the commonality between speech
perception and auditory discrimination abilities. Using the Test of Basic Auditory
Capabilities (TBAC) developed by Watson and colleagues (1982), the authors compared
listeners’ speech and nonspeech auditory abilities from the eight subtests that comprise
the TBAC. These included syllable identification in noise, discrimination of changes in
31
frequency, intensity and duration; discrimination of “jitter” within a pulse train;
discrimination of temporal order for sinusoids and nonsense syllables, and finally,
detection of single component changes in word-length tonal patterns. Even with the large
subject population (n=236) included in this study, inter-test correlations revealed nonsignificant relationships between NH listeners’ abilities to detect fine acoustic details of
nonspeech sounds and their skill at identifying speech sounds.
Similarly, Suprenant and Watson (2001) attempted to uncover the relationship (or
lack thereof) between basic auditory abilities and speech perception. They recruited 93
NH adult listeners between the ages of 18 – 32 years. Participants were presented the
(TBAC) (Watson et al., 1982). In addition, the speech perception portion of the study
consisted of tasks: CV identification, Word recognition, and Sentence comprehension.
CVs were presented in 2 different noise levels (0 and +3 dB SNR); Words were presented
in 2 different noise levels (0 and -3 dB SNR); Sentences ranging in length from 2 to 13
words, were presented in five different levels of noise (+3, 0, -3, -4, and -5 dB SNR).
Results revealed very weak correlations between spectral and temporal nonspeech
auditory abilities and speech processing measures. The authors, however, were hesitant to
conclude that individual differences in basic auditory abilities have little to do with one’s
ability to perceive speech. They suggest that because the task of perceiving speech, from
a cognitive standpoint, is such a complex skill, perhaps these tests of nonspeeech
psychoacoustic abilities are not sensitive enough to capture the complexity of the speech
signal. However, when the “basic auditory task” is elevated to include more complex
stimuli, as in speech context effects, there is still no evidence that they are predictive of
individual variability in degraded speech perception. Choe et al. (2009) for example,
32
explored the correlations between NH perception of dysarthric speech and phonetic
context effects that are presumed to be related to general auditory processes. These
context effects included duration contrasts (effects of vowel length on preceding /b/-/w/
categorization) and spectral contrasts (effects of preceding liquids on /d/-/g/
categorization). No significant correlations were obtained between dysarthric speech
transcription ability and performance on phonetic context effects tasks, which again
raises doubt that individual differences in basic auditory encoding/processing underlies
the variability in degraded speech performance.
Given the lack of predictive power offered by measures from psychoacoustics, one
begins to suspect that cognitive differences may be culpable for the variability in speech
performance. There is a substantial recent literature examining the role of cognitive
decline in the decreased performance of older adults in speech in noise tasks (e.g.,
Wingfield, 1996; Akeroyd, 2008; Schoof & Rosen, 2014; Besser, Festen, Goverts,
Kramer, & Pichora-Fuller, 2015; Füllgrabe, Moore, & Stone, 2015). The general
conclusions of this literature are that cognitive factors clearly play some role in the
differences between older and younger adults in performance, the exact measures that
provide the best predictions are unclear. Akeroyd (2008) reviewed 20 studies on age
differences in speech in noise performance and found that no single measure was
consistently related to outcome. Füllgrabe et al. (2015) found that a composite cognitive
measure provided some predictive power but that single measures were weakly correlated
with performance and Schoof & Rosen (2014) showed no correlation between speech in
noise perception with measures of working memory or processing speed. In their study
with dysarthric speech and young adult NH listeners, Choe et al. (2009) found a weak
33
correlation (r = 0.27) between a measure of working memory/executive control and
performance on speech transcriptions.
The lack of significant predictors for speech performance may be due to two
possibilities. The first is that the variability in outcomes in these tasks does not represent
stable individual differences. It is possible that the variability in performance is random
variation or measurement error that does not relate to intrinsic differences in
perceptual/cognitive processing. Most studies of degraded speech perception only
measure performance on a single speech task presented once. Thus, it is hard to gauge if
the variability observed is robust across tasks or across multiple repetitions of the same
task. The second possibility is that the general auditory and cognitive measures that have
been used previously are too coarse to measure the differences in the specific processing
that is required to correctly perceive degraded speech.
Why is degraded speech so difficult? One reason is that the cues that are most
relevant for speech sound categorization in quiet become less reliable when degraded. To
correctly perceive degraded speech a listener may need to rely on secondary cues or to reweight their use of cues to match the reliability of these cues in the degraded condition –
they must adapt their perception. McGettigan et al. (2008) came to a similar conclusion
in their study on individual differences in learning noise-vocoded speech. They suggest
that the increase in performance across training sessions with this degraded input was due
to the increased use (or increased weighting) of cues that are relatively preserved in
noise-vocoded speech such as duration and voicing.
In the McGettigan et al. (2008) study, the participants who performed the best in this
degraded speech task were likely those who were more adaptive at making this switch.
34
Such perceptually adaptive behavior may not be effectively measured by psychoacoustic
tasks of spectral/temporal resolution or by general cognitive tests of working memory
capacity or IQ. The essential importance of perceptual flexibility for degraded speech
performance may explain why Crandell (1991) and Plomp (1986) found no correlation
for speech in quiet and speech in noise performance among hearing aid users. Also recall
the large variability noted in the intelligibility rating test for older adults with normal
hearing mentioned previously where the mean intelligibility rating was 28.4% with a
standard deviation of 21.1%. Here the authors speculate that perhaps this variability could
be accounted for by some measure of cognitive or language ability (Vaughan &
Letowski, 2015). Perhaps if we could explicitly measure perceptual adaptability to
changes in the reliability of particular cues in speech categorization, it would provide a
greater predictor of degraded speech performance than the general measures that have
been used to this point.
Another concern is the reliability of these individual differences in degraded speech
perception ability. There is little evidence whether this variability is a result of true, stable
individual differences and not just variability arising from measurement error. If this is a
true difference in performance variability than it should be stable across time and one
would expect that there would be some correlation in performance across different types
of degradation. This is similar to the notion of the g factor in the measure of intelligence
– a general measure that correlates positively with other tasks of a similar nature. This is
an important notion because it attempts to explain these individual differences, presuming
that there are in fact stable differences in performance.
35
1.3 Proposal
In order to make advances in understanding the causes of degraded speech perception
variability (with implications for variability in outcomes for hearing impaired
populations), the project described below will attempt to establish if degraded speech
perception variability is the result of reliable individual differences and whether these
differences can be predicted to some extent by an individual’s ability to adapt their cue
weighting strategy in well-controlled speech tasks. There are three specific aims of the
project:
Specific Aim#1: Establish whether individual differences in normal hearing listeners’
ability to perceive degraded speech are reliable across multiple sessions and consistent
across a variety of degraded speech tasks.
Specific Aim#2: Establish whether individual differences in normal hearing listeners’
cue weighting adaptation in categorization tasks is reliable across sessions and consistent
across multiple speech categorization contrasts.
Specific Aim #3: Determine the extent to which individual differences in degraded
speech perception can be explained by measures of listener cue weighting flexibility.
36
CHAPTER 2
Reliability of individual differences in degraded speech perception
2.1 Introduction
The first goal of this project was to determine whether individual differences in
normal hearing (NH) listeners’ speech perception abilities are true, reliable differences.
Three different types of degraded speech were used in this study for the purpose of
presenting a variety of degradation. The first, noise-vocoded speech, as stated previously,
is a type of degradation used to simulate speech through a cochlear implant. This type of
degradation maintains the temporal (or rhythmic) information in the signal while
degrading fine-grained spectral information by removing harmonics and degrading
formant peaks. The second, time-compressed speech, which emulates rapidly produced
speech, will serve as a form of temporal degradation in which the signal (in this case, a
monosyllabic word) is compressed by a percentage of the original length. The third form
of degradation is speech in babble noise, in which monosyllabic words are embedded in
one second of multi-talker babble. The purpose of including this type of degradation is
that it is a more common degradation in the literature on hearing impairment and agerelated difficulties with speech perception.
Together, these three types of degraded speech provide a good test of cross-task
individual differences, allowing us to examine whether performance variability is specific
to a type of degradation. Each task includes a different challenge for the listener –
resolving the temporal structure of the speech stimulus (time-compressed speech),
37
resolving the spectral information (noise-vocoded speech), or segregating the target
speech stimulus amid competing speech streams (speech in babble noise).
2.2 Experiment 1 – Degraded Speech Perception
Specific Aim#1: Establish whether individual differences in normal hearing listeners’
ability to perceive degraded speech are reliable across multiple sessions and consistent
across a variety of degraded speech tasks.
The purpose of Experiment 1 was to test the hypothesis that the individual
variability that arises under challenging listening conditions for NH listeners is reliable. If
this is the case, individual differences in performance should persist across various
degraded speech tasks and across multiple sessions. In particular, strong correlations
between performance on the same tasks across testing sessions is a minimum for
declaring robust individual differences. Significant correlations across tasks would
suggest that some perceptual-cognitive ability is predictive of performance even when the
specific degradation challenges differ. This is similar to the general intelligences (g)
factor that Spearman (1904) claims based on correlation across mental talks.
2.2.1 Methods
Stimuli
A list of 300 monosyllabic words was selected as the target words for this study. The
words were selected on the basis of word frequency and variety of vowels and
consonants. The goal was to sample a large number of consonant-vowel-consonant
38
combinations. All 300 words were recorded by a native English male speaker in a soundproof booth by a Sony PCM-M10 solid-state recorder with a Shure FP23 preamplifier.
The microphone used was a Countryman E6 omnidirectional head-mounted microphone,
with a 0 dB flat cap. All recordings were sampled at 41.1K Hz, 16-bit. The speaker was
instructed to produce each word three times. The recorded word with the clearest
articulation of the three was selected for each token. The list was randomized through
online list randomizer (www.random.org) and divided into three separate lists of 100
words each (See Appendix A for the complete word list). Each list was digitally
manipulated to present three types of speech degradation: noise-vocoded speech,
compressed speech, and speech in babble noise.
Noise-vocoded Speech
Noise-vocoded speech stimuli were created by first dividing the speech signal into
8 separate frequency bands; this is similar to the individual electrodes in a cochlear
implant (i.e. each frequency band is analogous to a single CI electrode). The amplitude
envelope of each frequency band is applied to white noise, filtered to correspond to each
frequency band. This process is delineated with spectrograms and an example sentence in
Figure 8. The result of this process is a stimulus that sounds like a harsh whisper, or as
one would imagine Darth Vader would sound like with a sore throat. The 8-channel
noise-vocoded stimuli in this study were created using a script in PRAAT (Boersma &
Weenkink, 2005). The words were first filtered into eight logarithmically-spaced
frequency bands with the 3-db down cutoff values at 100, 473, 713, 1046, 1509, 2152,
3043, 4281, and 6000 Hz. These values were equivalent to the 8-band processed speech
of Fu, Shannon & Wang (1998) and represent similar values to what would be used in an
39
8 channel Continuously Interleaved Sampling CI processing strategy. The amplitude
envelope for each channel was then extracted by full-wave rectification (squaring each
value) of the signal followed by low-pass filtering with a 50-Hz cutoff. Gaussian
(White) noise is then filtered by the same bank of eight filters and then modulated with
the corresponding amplitude envelope extracted from the speech. These modulated noise
bursts are then recombined into a single mono signal.
Figure 8 Steps involved in converting clear speech (left spectrogram) into noisevocoded speech (right spectrogram). The word is filtered into 8 frequency bands (Step
1) and amplitude envelopes of each band are extracted (Step 2). Band-pass noise is
modulated from extracted amplitude envelopes (Step 3), and finally, all frequency
bands are added together (Step 4).
40
Time-compressed Speech
The words in each list were time-compressed by a factor of 3, using pitchsynchronous overlap and add (PSOLA) procedure (Moulines & Charpentier, 1990)
incorporated into PRAAT (Boersma & Weenink, 2005). This procedure allows for the
spectral properties of the time-compressed signal (e.g. formant patterns) to be altered in
duration while the pitch (fundamental frequency) contour remains the same. While this
form of compression is somewhat different from naturally produced rapid speech (as it is
a linear compression), there is evidence to suggest that listeners find the linearly timecompressed speech more agreeable than natural-fast time-compression (Janse et al., 2003;
Janse, 2004). This is important for the current work as it suggests a lesser listening effort
for linearly time-compressed speech. Figure 7 depicts a spectrogram and waveform
example of a time-compressed stimulus from the current study.
Speech in Babble Noise
Adobe audition 1.5 (Adobe Systems, Inc., 2004) software was used to embed
target in multitalker babble. Stimuli were sampled at a rate of 44,100 Hz. Speech in
babble noise stimuli were created using a 4 second recording of 4-talker babble (2
female, 2 male). One-second samples were randomly selected from the babble noise and
spliced with the target words. The words were embedded 100 msec after the onset of
babble and a smoothing ramp was added to the first and last 10 msec of each stimulus.
41
2.2.2 Preliminary study: determining equal difficulty across tasks
A primary concern when examining individual differences across a variety of
speech tasks is the very likely possibility that not all tasks are created equally – equally
difficult, that is. The purpose of this preliminary study was to control for difficulty levels
across three degraded speech tasks (noise-vocoded speech, compressed speech and
speech in babble noise) by adjusting certain parameters in order to achieve an average of
50% words correct for each task. This preliminary work provides two benefits. First, it
equates tasks on relative difficulty allowing for more valid comparisons of performance
across tasks. Second, it increases the sensitivity of the tasks to individual differences
given that there will less likely be ceiling or floor effects.
All three degraded speech conditions were presented to a group of 13 NH
listeners in order to determine roughly equivalent levels of difficulty across tasks. From
previous studies examining intelligibility of noise-vocoded speech, NH listeners’
performance with a 6-channel vocoder results in about 15-30% average words correct
(Carbonell & Lotto, 2013; Fu, Shannon & Wang, 1998). For this reason an 8-channel
vocoder was used for the preliminary test.
The initial speech compression rate was set to 30% of the original length of each
stimulus based on work by Ghitza and Greenberg (2009), examining intelligibility of
time-compressed speech. The authors suggest that a 30% compression rate results in
about 50% words correct in a task that presented listeners with full sentences. A 20%
compression rate was also included after an initial several subjects scored well above the
50% criterion with the 30% compression rate.
42
The 50% recognition point for speech in babble noise has been recorded at 4 dB
SNR for the Words-in-Noise (WIN) test (Wilson, 2003; Wilson et al., 2006), so the
preliminary study was conducted with 4-talker babble with a SNR of 4 dB. It became
clear after the first four participants (percent correct = 78.8%. 81.8%, 72.7%, and 74.2%)
that the SNR level was not low enough to achieve 50% correct average performance.
Subsequent follow-up tests included a range of SNRs from -3 dB to 3 dB.
This preliminary investigation was aimed at determining the level of degradation
required for each tasks to elicit on average 50% correct. Scores from each task fell near
the 50% criteria with the following parameters: 8-channel noise vocoding, 20%
compression rate, and -3 dB SNR.
2.3 Current Study
Participants
A total of 90 participants were recruited for this study, but 24 were discarded because
they were unable to complete all of the tasks across multiple sessions. Ultimately, 66
college-age adults (56 female, 10 male; age range: 18 – 40 years old) with self-reported
normal hearing1 and no history of speech or language deficits participated in this study.
All participants were native speakers of English. They participated either for class credit
or were compensated 10 dollars per session.
1
No hearing screenings were performed on the participants in this study, as all stimuli were presented at
suprathreshold levels (75 dB SPL). However, it is possible that despite reports of normal hearing, some
participants may have thresholds that fall above the 20 dB cut-off for normal hearing. That being said, the
extensive background literature (described in this work) demonstrating similar variability in performance
while having administered hearing screenings, provides evidence that this variability could not be merely
be due to small variances in hearing thresholds. There is also substantial evidence from this same body
literature to suggest that variability in performance in degraded listening conditions goes beyond
measures of audibility. This was taken as evidence that any variability in hearing thresholds of this sample
would not significantly account for individual differences in degraded speech performance.
43
Procedure
Each participant performed a word recognition task with each degraded speech
type over 2 separate sessions. Session 1 consisted of two parts: 1) Degraded speech task
consisting of word recognition tasks corresponding to each type of degraded speech
stimuli (noise-vocoded, compressed, and speech in babble noise); 2) Working memory
task – shortened version of the Operation Span (OSPAN) task (Foster et al., 2014)
(discussed further in Chapter 4). Counterbalancing of word lists and task order were
achieved by assigning each participant a different order of task presentation during each
session, and ensuring that they did not hear the same list of words for a single
degradation type. For example, Participant 1 received the following order of degraded
speech tasks and word lists during Session 1:
1) Noise-vocoded – List 1
2) Compressed – List 2
3) Babble Noise – List 3
During Session 2, she received the following order of tasks and word lists:
1) Noise-vocoded – List 2
2) Babble Noise – List 1
3) Compressed – List 3
Part 1- Degraded Speech Task: Subjects were seated in front of a computer monitor in a
sound-attenuated booth and asked to type the word that they heard for each trial. Stimuli
were presented through circumaural headphones (Sennheiser HD 280) at approximately
44
75 dB SPL. The degraded speech tasks were presented entirely through the ALVIN
software (Hillenbrand & Gayvert, 2003). Figure 9 depicts an example of the experiment
screen in ALVIN. Subjects were not given feedback and could not replay any of the
stimuli. The entire degraded speech task lasted approximately 30 minutes. Subjects were
allowed a break between Part 1 and Part 2 of Session 1.
Figure 9. Example of the experiment screen view for the Degraded Speech word
recognition task in ALVIN software (Hillendbrand & Gayvert, 2003)
Part 2 – Working Memory Task: Following the degraded speech task, subjects completed
a working memory span task. Working memory span was measured with a shortened
version of the modified operation span (OSPAN) task (Foster et al., 2014)
http://englelab.gatech.edu/tasks.html. The shortened OSPAN is completely automated in
E-prime software (Schneider et al., 2002), and lasts approximately 25 minutes from start
to finish. The task required participants to verify simple math equations while recalling
45
sequences of orthographic English letters. For example, the participant would first see a
simple equation (e.g., (4 x 2) – 7 = 4?) on the computer monitor, and then required them
to judge whether the equation was correct or not by selecting a YES or No key on the
keyboard within 6 seconds. Visual feedback (e.g., ‘Correct’, ‘Incorrect’ ‘Too late!’) was
provided after each trial. Immediately following the feedback, a single letter is displayed,
then subjects are required to solve another math problem, after which they are shown
another letter. This sequence of math-letter is repeated anywhere from three to seven
times per trial with an unpredictable length each time. After each sequence, subjects are
tasked with recalling, in the correct order, each of the preceding letters. Figure 10 depicts
examples of the order of visual stimuli presented on the computer monitor for the
shortened OSPAN: equation  feedback  letter  equation  feedback  letter.
Figure 10. Example of the visual sequence of the
Operation Span Task, modeled after Foster et al.
(2014).
46
Follow-up Session
Each participant was scheduled to repeat the degraded speech task on a following
day (a minimum of 2 days between sessions was required to lessen the effects of
learning). During the follow-up session, participants were presented with the same
degraded speech types and asked to perform the same word recognition task from Session
1, only this time with a different task order and list order.
Scoring Procedure
The degraded speech word recognition tasks were scored as percent correct word
identification for each task (100 target words in each task. Participant’s responses were
coded as either correct (1) or incorrect (0), where the total number of correct responses
reflected the overall percent correct. Given that each response was manually typed by the
participant, homophones and common misspellings were coded as correct since they
evidence accurate perception of the intended sounds, e.g. the response "not" for the word
"knot", or the response "reep" for the word "reap". Working memory scores were
quantified by adding the number of letters correctly recalled in the correct order, known
as the ‘Partial Span Score’ (Turner & Engle, 1989). Responses to the math problems were
not included in the analysis, but all participants were required to maintain a level of at
least 80 percent correct on the math responses throughout the experiment. This ensured
that the math problems acted as true distractor stimuli.
47
Statistical Analyses
Correlational analyses and linear regression models were computed across all
degraded speech types and both sessions. Overall means and standard deviations for each
task are reported in Table 1.
Table 1. Overall means and standard deviations for each degraded speech task for
Session 1 and Session 2 (indicated by day)
Degraded Speech Task
Babble Day 1
Babble Day 2
Compressed Day 1
Compressed Day 2
NV Day 1
NV Day 2
Mean %
Correct Score
50.167
53.652
66.636
70.242
51.818
56.909
Std.
Deviation
8.1780
8.0412
8.9471
9.6557
9.1398
8.7455
2.3.1 Results
Reliability across tasks – Session 1
Predictions from Experiment 1 stem from the hypothesis that individual
differences in perception of degraded speech are true, reliable differences, independent of
the type of degradation. Under this presumption, percent correct scores on the individual
word recognition tasks for noise-vocoded speech, speech in babble noise and compressed
speech should correlate significantly, and performance variability should also persist
across a variety of degraded speech tasks.
Results from comparisons between all three tasks in Session 1 reveal significant
correlations at the p <.01 level, suggesting reliability in performance across different
48
types of degraded speech tasks (see Table 2 for correlation values across all tasks for
Session 1).
Table 2 Correlational analyses across degraded speech tasks
for Session 1.
Compared Tasks Day 1
Correlations
Babble and Compressed
r = .47, p <.001
Compressed and NV
r = .42, p <.001
NV and Babble
r = .49, p <.001
Reliability across tasks – Session 2
There was also a significant correlation across scores on each tasks for Session 2
at the p <.01 level, indicating stable cross-task performance. (See Table 3 for correlation
values across all tasks for Session 2).
Table 3 Correlational analyses across degraded speech tasks
for Session 2.
Compared Tasks Day 2
Correlations
Babble and Compressed
r = .46, p <.001
Compressed and NV
r = .57, p <.001
NV and Babble
r = .41, p <.01
Reliability across tasks – averaged across both sessions
Correlational analyses were also computed across tasks with word recognition
scores averaged across days (see Table 4 for overall averages and standard deviations).
Speech in babble noise scores were strongly correlated with compressed speech scores,
49
r(64) = .59, p <.001, and noise-vocoded speech scores, r(64) = .53, p <.001. Noisevocoded speech scores were also strongly correlated with compressed speech scores,
r(64) = .58, p <.001. A linear regression model was used to test the predictive power of
each of the aforementioned correlations. When used together to predict the average score
for a single task, each of the previous correlations is equally predictive: compressed and
noise-vocoded, r=.63; compressed and babble noise, r =.63; noise-vocoded and babble
noise, r =.67. See Figures 11-13 for individual subject data on each task averaged across
sessions.
Table 4 Overall means and standard deviations for word recognition scores
on each degraded speech task averaged across both sessions.
Degraded Speech Task
Mean %
Std.
Correct Score
Deviation
Babble Average
51.909
6.7173
Compressed Average
68.439
8.6289
NV Average
54.364
8.4551
50
8-channel Noise Vocoded Speech
80
% correct word recognition
70
60
50
40
30
20
10
0
1
6
11
16
21
26
31
36
41
46
51
56
61
66
Subject
Figure 11. Individual data for word recognition scores on the 8-channel noise-vocoded
speech task.
51
Compressed Speech
90
% correct word recognition
80
70
60
50
40
30
20
10
0
1
6
11
16
21
26
31
36
41
46
51
56
61
66
Subject
Figure 12. Individual data for word recognition scores on the time-compressed (by
30%) speech task.
52
Speech in Babble Noise
90
% correct word recognition
80
70
60
50
40
30
20
10
0
1
6
11
16
21
26
31
36
41
46
51
56
61
66
Subject
Figure 13. Individual data for word recognition scores on the speech in
babble noise (-3 SNR) task.
Reliability across sessions
In order to test the reliability of scores across days, a linear regression model was
created for each degraded speech task across the two sessions, using days in between
sessions as a covariate. Speech in babble noise scores on day 1 significantly predicted
speech in babble noise scores on day 2,
= .37, t(64) = 3.17, p < .01; scores on day 1
were also predictive of scores on day 2 for compressed speech,
< .001, and noise-vocoded speech,
= .74, t(64) = 8.67, p
= .79, t(64) = 10.14, p < .001.
Reliability across sessions – average scores across all tasks
Average word recognition scores were computed across all tasks for each
session, resulting in two overall degraded speech scores for Session 1 and Session 2.
53
Average scores on Session 1 strongly predicted the average scores on Session 2,
=
.80, t(64) = 10.75, p < .001, with number of days between sessions included as a
covariate. Significant reliability was found between scores for each task across days. The
average measure Intraclass Correlation Coefficient (ICC) for Speech in Babble noise
scores was .542 with a 95% confidence interval from .253 to .720, F(65)= 2.185, p<.05).
ICC for Compressed speech scores was .836 with a 95% confidence interval from .733 to
.900, F(65) = 6.111, p<.001. ICC for Noise-vocoded speech scores were .881 with a 95%
confidence interval from .805 to .927, F(65) = 8.390, p<.001. Figures 14-16 show
individual scores for each task on Session 1 plotted against scores on Session 2.
Figure 14. Scatterplot of Speech in Babble noise scores on Session
1 (Babble1) and Session 2 (Babble2).
54
Figure 15. Scatterplot of individual scores on Time-Compressed
Speech for Sesion 1 (Comp1) and Session 2 (Comp2).
55
Figure 16. Scatterplot of individual scores on Noise-Vocoded Speech
for Session 1 (NV1) and Session 2 (NV2).
56
2.3.2 Interim summary and discussion
Three different types of degraded speech were presented to NH listeners for the
purpose of examining whether individual differences in performance ability are
consistent across a variety of degradation types. Participants were also asked to return a
second day to repeat the degraded word recognition tasks in order to test consistency in
their performance across time. Results from Experiment 1 suggest that the individual
variability in NH listener performance on degraded speech tasks is reliable not only
across multiple days, but also across tasks. Normal hearing listeners performed a word
recognition task on three different types of degraded speech – noise-vocoded,
compressed, and speech in babble noise. Participants’ scores on all three tasks were
strongly correlated, suggesting that individual performance is consistent across different
types of degradation.
All subjects returned for a follow-up session during which they were administered
the same degraded speech tasks. Both individual task scores and average scores across
tasks were strongly correlated, indicating strong test-retest reliability. In particular,
performance on each task averaged across the two days was well predicted (r values
between 0.63 and 0.67) by performance on the other two tasks.
The current experiment provides strong evidence for inherent variability in
performance that exists among normal hearing listeners when listening conditions
become challenging. Now that these individual differences appear stable and reliable, the
question becomes: what is driving this variability? The next chapter seeks to address this
57
question by attempting to determine the source of individual differences as a measureable
ability – Perceptual flexibility.
58
CHAPTER 3
Reliability of individual differences in cue weighting adaptation
3.1 Introduction
One of main reasons why the task of understanding degraded speech is so difficult
is that the acoustic cues most relevant for speech sound categorization in quiet are much
less reliable when the signal is degraded. In order to accurately perceive degraded speech,
a listener will likely have to rely on secondary cues or change their weighting of cues in
order to adapt their perception. For example, the degradation of spectral information from
noise-vocoding, which would lower the reliability of important acoustic cues such as
formants and fundamental frequencies, may be accommodated by relying more on
redundant temporal cues, such as duration. Likewise, temporal degradation that may
occur from time compression may be overcome by relying more on preserved spectral
cues. The ability of some listeners to accurately perceive speech high-pass filtered above
5 kHz (Vitela et al., 2015) must be a result of the use of high-frequency cues that would
be secondary to the robust phonetic cues available in the normal signal below 5 kHz.
Given that shifting cue use appears to be an essential task for accurately perceiving
degraded speech, it is worth speculating whether there are reliable measurable individual
differences in adaptability of cue weighting that may account for the individual
differences in degraded speech perception performance demonstrated in Chapter 2.
59
The experiments discussed in this chapter were designed to test a measure of cue
weighting adaptability based on a recently proposed empirical paradigm that claims to be
a demonstration of perceptual flexibility. In order to serve as a potential predictor of
degraded speech perception performance, there will need to be sufficient inter-subject
variance on this measure and individual scores will have to be stable across multiple
sessions. Two different tasks will be used – one that requires “up-weighting” of a
spectral cue—adapting to using a spectral cue more—and the other requiring “upweighting” of a temporal cue. The use of these two tasks will provide the opportunity to
look for generalized perceptual adaptiveness (correlation in performance across tasks
akin to Spearman’s g) as well as adaptation abilities that are specific to temporal and
spectral cues (independent variation akin to Spearman’s s scores). To the extent that
there are reliable specific adaptation abilities, performance on these tasks may
differentially predict outcomes with degraded speech depending on the particular
challenges of the form of degradation.
3.1.1 Cue Weighting
For any phonemic contrast, there are likely to be a number of acoustic dimensions
which can be used to distinguish the contrast – that is, to serve as cues for phonemic
identity. For example, Lisker (1986) reported as many as 16 different acoustic
distinctions between English /b/ and /p/ when produced inter-vocalically. Typically,
none of these acoustic cues are sufficient to distinguish phonemes across all speakers and
phonetic contexts. This is the “lack of invariance” problem that has plagued research in
speech perception since the 1960s (e.g., Liberman et al., 1967). The lack of any reliable
defining acoustic cue becomes even more obvious when one examines “spontaneous” or
60
“reduced” speech (e.g., Warner, Fountain & Tucker, 2009; Ernestus & Warner, 2011;
Warner & Tucker, 2011). On the other hand, it is not the case that all of the imperfect
cues to a contrast are equally informative. The voicing contrast between /b/ and /p/ in
English is distinguished partly by a difference in Voice Onset Time (VOT; shorter in /b/,
Lisker & Abramson, 1964) and also by a difference in the fundamental frequency (f0) of
the following vowel (lower in /b/, House & Fairbanks, 1953; Lehiste & Peterson, 1961;
Hombert, 1978; Kingston & Diehl, 1994). However, VOT is a much more reliable cue
for distinguishing these phonemes in English. A competent listener of English would be
best served by relying more on VOT than on f0 when making a judgement about the
identity of a segment as /b/ or /p/. In other words, the listener may integrate both pieces
of acoustic information in arriving at a phonemic category, but they should weight the
VOT cue more heavily than f0 when coming to a decision. This is, in fact, what native
English listeners do (e.g., Francis et al., 2008).
The concept of cue weighting in perceptual decisions goes back at least as far as
Brunswik’s (1947; 1955) “Lens Model”. Brunswik proposed that visual perception of
depth or object size were dependent on a number of imperfect cues that varied in their
ecological validity (Brunswik, 1955; Postman & Tolman, 1959). Optimal perceivers
would weight these various cues in arriving at a final decision with the weights reflecting
the actual validity or reliability of each cue in the physical world.
In terms of audition, there has been quite a bit of work in psychoacoustics of
measuring the weighting of acoustic cue use by correlating listener responses with
physical measurements for randomly varying acoustic dimensions (Berg, 1989; Lutfi,
1989; 1995). This approach has been used to provide information about cue use in
61
perception of loudness (Doherty & Lutfi, 1996; Kortekaas, Buus & Florentine, 2003), of
acoustic consequences of the material and size of struck bars and plates (Lutfi & Oh,
1997; Lutfi & Stoelinga, 2010); of auditory motion (Lutfi & Wang, 1999), of pitch of
harmonic complexes (Dai, 2000), and of spectral shape (e.g., Dai & Berg, 1992).
As applied to the perception of phonemic contrasts, the discussion of cue
weighting initially focused on developmental differences in speech sound perception
(Walley & Carrell, 1983; Nittrouer & Studdert-Kennedy, 1987; Nittouer, 1996; Nittrouer,
Crowther & Miller, 1998; Mayo & Turk, 2004). These studies demonstrated through
classic syllable categorization experiments that children (typically under the age of
seven) weight acoustic cues to some speech contrasts differently than do adults (e.g.,
Hazan & Barrett, 2000; Mayo & Turk, 2004; Nittrouer, 2004). More recently, there has
been a growing interest in describing the differences between native (L1) and second
language (L2) learner’s production and perception of phonemic contrasts in a cue
weighting framework (e.g., Iverson et al. 2003; Lotto, Sato & Diehl, 2004; Francis,
Ciocca, Ma & Fenn, 2008; Escudero, Benders & Lipski, 2009; Lim & Holt, 2011;
Schertz, Cho, Lotto, & Warner, 2015). For example, the notorious difficulty that native
Japanese learners have with English /l/ and /r/ (Miyawaki et al., 1975) may be in part due
to an improper weighting for the onset frequency of F3 (dominant in English) and F2
(secondary in English). Previous studies of L1-L2 cue weighting have shown that indeed,
L1 Japanese speakers will not weight F3 and F2 cues properly in order to perceive this /r/
- /l/ contrast (Iverson et al. 2003) – a strategy that also influences their production as well
(Lotto et al. 2004) of the L2 contrast.
62
Holt and Lotto (2006) provided a framework for the examination of cue weighting
in speech and non-speech perception based on some of the same logic underlying the
models developed for psychoacoustic studies. They proposed that one can measure the
relative cue weights for acoustic cues to a category by systematically varying the values
on each acoustic dimension and then correlating category responses with the values on
each dimension (and then normalizing these correlations to sum to one). To demonstrate
the utility of this model, they asked NH listeners to categorize frequency-modulated tones
on the basis of two independent dimensions: center frequency (CF, perceptually related to
pitch) and modulation frequency (MF, perceptually related to “roughness”). Each
dimension was equally informative for the task. Listeners were trained with feedback (via
illumination of a light above the button corresponding to the correct response) on the
categorization of two predefined tone distributions. While the optimal strategy for this
task would be to weight both dimensions equally, listeners tended to weight the CF
dimension more heavily. For the purposes of the current study, one of the interesting
outcomes of the Holt and Lotto (2006) study was that there were substantial individual
differences in the measured cue weights across the set of NH participants. Figure 17 is a
novel replotting of the relative cue weight for the CF dimension for each listener based
on raw data from the Holt & Lotto (2006) study. As can be seen some listeners actually
were optimally weighting the two dimensions (giving them equal weight), whereas other
listeners were relying heavily on the CF cue.
63
Figure 17. Re-plot of Center Frequency (CF) weighting for individual subjects
in tone categorization from Holt & Lotto (2006).
This new analysis of the Holt and Lotto (2006) data provides some hope that there
are robust differences in cue weighting strategies between NH individuals. It is possible
that such differences could underlie the differences in performance on degraded speech
perception tasks. Of course in order to account for the correlated performance across
degradation types, one cannot rely on stable differences in cue weights but on the ability
to adapt those weights to be near optimal based on the particular listening challenge.
64
3.1.2 Adaptability
The development of appropriate cue weights for one’s native language is likely
the result of experience with a large number of speakers. However, these “average”
weights are likely to be wrong in any specific listening condition. In particular, when
speaking to someone with a non-native accent, the “average” weights may lead to
incorrect phoneme categorizations. To be a robust communicator, one may need to adapt
those weights to a particular talker or listening condition. As discussed in the
introduction of this chapter, the listening challenges provided by different degraded
speech tasks may be described in terms of changing one’s cue weights to “down-weight”
unreliable acoustic dimensions and “up-weight” more reliable (less degraded) acoustic
dimensions.
As an example of this form of adaptability, Azadpour and Balaban (2015)
examined the mechanisms underlying perceptual adaption to degraded speech in the form
of severe spectral distortion – spectral rotation. The goal of this study was to test the
plausibility of three different possible explanations underlying perceptual adaptation. The
first hypothesized that the listener would perform a phonemic remapping of the degraded
speech sounds; the second suggested an experience-dependent generation of an inversetransform – much like inverted visual category remapping (Held, 1965); the third
possibility was that listeners would change or adjust their cue weights (a shifting of
perceptual emphasis on acoustic cues that are least affected by spectral-rotation). The
spectrally rotated speech in this study was created by inverting the speech spectrum,
which severely distorts the spectral context but preserves temporal information. Twelve
adult, NH listeners were divided into 6 pairs, so as to serve as conversation partners
65
throughout the length of the experiment. Each pair participated in five 1-hour training
sessions where they were tasked with extracting spoken messages in spectrally-rotated
speech of their training partner (in separate rooms) with the goal of continuing a
conversation as naturally as possible. During the first training block, each pair was
allowed feedback of the spectrally rotated speech – a feature that was disabled in future
sessions. Each participant was given a set of perceptual tests before the first training
session and after the final training session in order to evaluate perceptual learning of the
degraded speech. The tasks consisted of sentence comprehension, phoneme identification
and syllable discrimination (all stimuli consisted of spectrally-rotated speech from their
training partner). Significant learning curves were observed between pre-training and
post-training tests. Results also revealed that the most likely strategy underlying
perceptual adaptation to degraded speech is cue-weighting. That is, the listeners placed
more weight or emphasis on the acoustic phonetic information in the signal that was least
affected by the distortion.
If adaptation of cue weighting is an essential process in correctly perceiving
degraded speech as suggested by Azadour and Balaban (2015) then if listeners differed
on their ability to adapt their cue weights this would presumably be predictive of their
ability to perceive degraded speech. Unfortunately, there are no current measures of
adaptability of cue weighting for L1 contrasts and, thus, no way to know whether there
are consistent individual differences in cue weighting adaptation that could predict
degraded speech perception performance.
A new paradigm for examining adaptation in cue weighting provides some hope
of obtaining a useful individual difference measure. Idemaru and Holt (2011, 2014)
66
played to participants the spoken words beer and pier and asked them to respond by
clicking on the appropriate pictures on a screen with no feedback. The acoustic stimuli
varied both in VOT and in f0 of the vowel. During the initial block, VOT and f0 were
correlated as they are naturally for English – high f0 associated with long VOT (the
Canonical block). In a second block, the correlation between VOT and f0 were reversed
– low f0 associated with long VOT (the Reverse block). Both blocks included test stimuli
that were ambiguous with regard to VOT but were either high or low in f0 (there were no
distinctions between test trials and non-test trials). By examining the difference in /p/
responses to the high and low f0 test stimuli one can determine how much f0 is being
weighted for this distinction. During the Canonical block, listeners responded /p/ much
more for the high f0 than for the low f0 demonstrating that this cue was being used to
discriminate the contrast for these listeners. On the other hand, during the Reversal
block, participants responded statistically equivalently to the high and low f0 test stimuli,
suggesting that they no longer weighed f0 into their /b/ versus /p/ categorizations. This
shift in cue weighting occurs without feedback or any indication that there is a difference
between the first and second block.
Recently Schertz et al. (in press) adapted this paradigm to examine cue weight
adaptation for L2 learners. They asked native Korean speakers learning English as a
second language to perform a similar task of categorizing English /b/ and /p/ except that
there was no picture display. Participants categorized syllables as “ba” and “pa” by
clicking on an appropriately labeled button after hearing the syllable in the carrier phrase
“I say…”. One of the most salient results of this experiment was that these L2 learners
varied widely in their initial cue weights for VOT and f0 and their patterns of adaptation
67
also differed as a function of those initial cue weights. Whereas this study provides
evidence that performance on this adaptation paradigm may be influenced by individual
differences, the differences that were seen were due to different initial weights and were
for an L2 contrast. However, the experimental paradigm does appear to tap into some of
the processes that would be important for L1 degraded speech perception - in particular,
the shifting of cue weights on the basis of changing reliabilities of acoustic cues even
without external feedback.
The purpose of Experiment 2 is to examine whether this paradigm provides the
basis for an evaluation of individual differences in cue weighting adaptation. In
particular, do individuals vary in the amount of adaptation witnessed from the Canonical
to the Reversal block and, if so, are these differences stable across multiple testing
sessions. In addition to including the Schertz et al. (in press) “ba”/”pa” distinction,
which requires the “down-weighting” of a spectral cue (f0), a second task will be
included that includes the “down-weighting” of a temporal cue. Liu, Soh and Holt (2011)
asked listeners to identify spoken “set” and “sat” that varied in spectral quality (SQ- the
pattern of vowel formants) and in duration of the vowel. In this case, SQ is the dominant
cue (Hillenbrand, Getty, Clark & Wheeler, 1995) and the two cues co-vary such that
formant frequencies appropriate for the vowel /æ/ are associated with longer durations.
Liu et al. (2011) have demonstrated that listeners “down-weight” the use of duration
during a Reversal block where the correlations between the cues are reversed. The use of
the two tasks will provide an opportunity to see if there are general abilities to adapt cue
weights across tasks (a correlation in adaptation across tasks). On the other hand, it may
be the case that listeners differ in their ability to adapt spectral (f0) versus temporal cues.
68
If this is the case, then these two tasks may differentially predict performance on
degraded speech tasks depending on the type of degradation. Obleser, Eisner and Kotz
(2008) found that spectral versus temporal degradation of the speech signal resulted in
distinct lateralization patterns in a brain imaging study. Such differential patterns may
point to distinct processes involved in the comprehension of spectral versus temporal
degradation, which may have independent predictors (in addition to some common
predictors considering the correlation in performance across tasks obtained in Experiment
1).
3.2 Experiment 2 – Consonant and Vowel Adaptation Paradigms
Participants
The same group of 66 subjects from Experiment 1 participated in both the Vowel
(“set”-“sat”) adaptation and Consonant (“ba”-“pa) adaptation tasks. Adaptation sessions
always occurred after the two speech degradation tasks were completed. That is, the
subjects participated in four different sessions on four different days.
3.2.1 Methods for Vowel Adaptation Paradigm
Stimuli
Stimuli for this task were created by Liu and Holt (2015) from natural recordings
of the words “set” and “sat” by an adult native English female speaker. Stimuli were
created to vary along two dimensions: spectral quality (pattern of formant frequencies
ranging from canonical // to canonical /æ/) and vowel duration. Each dimension
consisted of a seven-step series, resulting in 49 total words. Spectral quality (SQ) of the
69
vowels was manipulated in Praat (Boersma & Weenink, 2009) to contain equal steps
between // and /æ/, with formant trajectory values interpolated in R (R Development
Core Team, 2008).Vowel duration was varied from 175 ms to 475 ms in 50 ms steps
using the PSOLA function in Praat. Stimuli in the Neutral block consisted of “set” and
“sat” tokens that spanned the entire covarying SQ-duration space. The purpose of the
Neutral block was strictly to orient the listener to the acoustic space; responses were not
included in any of the analyses.
Procedure
Participants were seated in front of a computer monitor in a sound-attenuated
booth and were given oral and written instructions that they would hear the English
words “set” or “sat” and that they should press the ‘z’ key on the keyboard if they hear
“set” or the ‘m’ key if they hear “sat”. The position of the visually displayed words
options on the screen matched the relative positioning of the corresponding keys to
identify them on the keyboard (“set” word and ‘z’ key on the left; “sat” word and ‘m’ key
on the right). Participants were told that the experiment would be divided into three
blocks, but were not informed that there would be any difference between the blocks in
any way. All stimuli were presented via circumaural headphones (Sennheiser HD 280) at
approximately 75 dB SPL. A Neutral block consisting of “set” and “sat” tokens that
spanned the entire covarying SQ-duration space was presented first (see Figure 18a). The
purpose of the Neutral block was strictly to orient the listener to the acoustic space;
responses were not included in any of the analyses. The Neutral block consisted of 5
repetitions of the 25 stimuli marked by the gray box in Figure 18a for a total of 125 trials.
70
Following the Neutral block was the Canonical block in which the the stimuli maintain
the normal correlation between spectral quality and duration from natural productions
(formant values for /æ/ having a longer duration). The 18 stimuli marked as “Exposure
stimuli” in Figure 18b were presented 10 times each and maintained the natural
correlation between the two dimensions. Two test trials were also presented 10 times
each. These trials were equal and ambiguous on the SQ dimension but differed on the
Duration dimension (represented as diamonds in Figure 18b). These trials did not differ
in any way from the non-test trials other than the stimulus presented. There were a total
of 210 trials during this block. Following the Canonical block was a Reverse block in
which the natural correlation between SQ and Duration was reversed by playing only the
Exposure stimuli marked in gray in Figure 18c as well as the test stimuli. Note that
participants were not given any indication of the differences between the blocks nor were
there any differences between the last two blocks other than the Exposure stimuli
presented. The Vowel Adaptation paradigm was presented entirely over headphones in EPrime software (Schneider et al., 2002). Subjects were not giving feedback on any of the
trials in any block. Small breaks were allowed between blocks.
71
Figure 18 . Schematic of Vowel adaptation task by block. Purple and orange
diamonds represent the test stimuli in each block [modeled after plot from Liu, Soh
& Holt (2011)].
3.2.2 Consonant Adaptation Paradigm
Similar to the vowel adaptation paradigm, this study also consists of varying
acoustic cues of familiar English sound contrasts, in this case, the voicing contrast
between /b/ and /p/. The paradigm is adapted from Idemaru & Holt (2011) and is similar
to that used by Schertz et al. (in press). In this case, the cue that should be “downweighted” in the reversal block is spectral (f0) as opposed to the temporal cue of vowel
duration in the vowel adaptation paradigm.
72
3.2.2.1 Methods for Consonant Adaptation Paradigm
Stimuli
A natural English voiceless stop recorded by a female native-English speaker was
manipulated on the dimensions of f0 and VOT (or aspiration duration). F0 values ranged
from 160 – 240 Hz; and VOT ranged from -20 – 50ms. VOT was manipulated in Praat
using the Time-Domain F0-Synchronous-Overlap-and-Add algorithm (TD-PSOLA,
Moulines & Charpentier, 1990). Negative VOT tokens (i.e. prevoiced tokens) were
created by adding pitch periods of voicing with no aspiration at formant transition onset.
F0 was aslo manipulated in Praat using the TDPSOLA algorithm. The steady-state
portion of the vowel was set to the appropriate f0 level and then decreased linearily to
140 Hz for all stimuli. All stimuli were then embedded in the English carrier phrase “I
say __” recorded by the same female speaker. There were nine steps each for duration
and f0, for a total of 81 distinct stimuli.
Test stimuli in both the Canonical and Reverse blocks consisted of two tokens
with an ambiguous VOT (15 msec) and an f0 of 170 Hz and 230 Hz. Similar to the vowel
adaptation task, the Neutral block for this task consisted of stimuli that spanned the entire
co-varying f0/VOT space, for the purpose of orienting the listener to the acoustic space
(see Figure 19a).
73
Figure 19. Schematic of Consonant adaptation task by block. Purple and orange
diamonds represent test stimuli in each block [modeled after plot from Liu, Soh &
Holt (2011)].
Procedure
Listeners were seated in front of a computer monitor and presented a 2-alternative
forced choice task. The entire experiment consisted of 3 blocks: Neutral block ,
Canonical Block and Reverse Block. For each block, participants were asked to push
buttons labeled “ba” and “pa” to categorize the syllable that was presented. The Neutral
block consisted of “ba” and “pa” tokens that spanned the entire covarying f0-VOT space
(see Figure 19a). As in the Vowel Adaptation paradigm the purpose of the Neutral block
was strictly to orient the listener to the acoustic space, and again, participantresponses for
this block were not included in any of the analyses. The Neutral block consisted of 2
repetitions of the 81 stimuli marked by the gray box in Figure 19a for a total of 162 trials.
Following the Neutral block was the Canonical block in which the the Exposure stimuli
74
maintain the normal correlation between f0 and VOT from natural productions (higher f0
values associated with longer VOT). The 10 stimuli marked as “Exposure stimuli” in
Figure 19b were presented 20 times each for a total of 280 trials. Two test trials were
also presented 20 times. These trials were equal and ambiguous on the VOT dimension
(15 ms) but differed on the f0 dimension (170 & 230 Hz. Test stimuli represented as
orange and purple diamonds in Figure 19b). These trials did not differ in any way from
the non-test trials other than the stimulus presented. Following the Canonical block was
a Reverse block in which the natural correlation between f0 and VOT was reversed by
playing only the Exposure stimuli marked in gray in Figure 19c as well as the test stimuli.
As in the Vowel Adaptation task, participants were not given any indication of the
differences between the blocks nor were there any differences between the last two
blocks other than the Exposure stimuli presented. The Consonant Adaptation paradigm
was presented entirely over headphones in PsychoPy software (Peirce, 2007). Subjects
were not giving feedback on any of the trials in any block. Small breaks were allowed
between blocks.
Follow-up Session
All subjects were asked to return a second day to repeat both the Vowel Adaption
Paradigm and Consonant Adaptation Paradigm for the purpose of determining whether
listeners’ ability to perceptually adapt to varying acoustic cues is reliable. The task order
was couterbalanced across sessions to control for any order effects. Participants received
identical instructions for both tasks as they did in Session 1.
75
Scoring Procedure
For the purpose of this study only responses from Blocks 2 and 3 (Canonical and
Reverse, respectively) were used for analysis. Again, the purpose of the Neutral block
was only to orient the listener to the acoustic space. “Cue Weighting” for the secondary
cue in each task (Duration and f0 for Vowel and Consonant adaptation, respectively) was
operationalized as the difference in categorization responses for the two test stimuli in a
block. So, the cue weight for f0 in the Canonical Consonant block would be the
difference in % /p/ responses for the high f0 test stimulus minus the low f0 test stimulus.
A large difference in these scores was taken as an indication that the listener was using
this cue for consonant categorization whereas a small difference would suggest that there
was little weight given to variability on this dimension. This operational measure of cue
weighting is conceptually related to the measures used by Holt and Lotto (2006) in their
model of cue weighting. The measure of import for the current project is the adaptation
of these cue weights – the change in the weight as one moves from Canonical to Reverse
blocks. Thus, the operational measure of “adaptation” was the difference between
difference scores – the difference in responses to test stimuli in the Canonical block
minus the difference in responses to test stimuli in the Reverse block. To be concrete,
adaptation was measured in the Vowel Adaptation task as the ((% “set” responses for
long duration test stimulus in Canonical block - % “set” responses for short duration test
stimulus in Canonical block) - (% “set” responses for long duration test stimulus in
Reverse block - % “set” responses for short duration test stimulus in Reverse block)).
Adaptation was measured in the Consonant Adaptation task as the ((% “pa” responses for
high f0 test stimulus in Canonical block - % “pa” responses for low f0 test stimulus in
76
Canonical block) - (% “pa” responses for high f0 test stimulus in Reverse block - % “set”
responses for low f0 test stimulus in Reverse block)).
3.3 Results
The goal of Experiment 2 was to test the reliability of NH listener performance on
cue weighting adaptation tasks for the purpose of using these measures as possible
predictors of performance on degraded speech tasks (discussed further in Chapter 4). In
order to assess reliability of listener performance on cue weighting adaptation, listeners
participated in two separate adaptation tasks and retested on a following day.
Replication of Adaptation Effects
Although it was not a main goal of this study, the data do provide an opportunity
to test the replicability of the original effects of Idemaru and Holt (2011) and Liu, Soh
and Holt (2011) with larger sample sizes. Both studies found that there was a significant
decrease in the measure of cue weighting (difference scores between the test stimuli) for
the Reverse block versus the Canonical block. These effects can be tested here with a
within-subjects 2x2 (Test Stimulus x Block) ANOVA, where a significant interaction
would be indicative of cue weight adaptation. For the Vowel Adaptation task, the
predicted interaction was significant for session 1 (F(1,65) = 237.66, p<.001; See Figure
20) and for session 2 (F(1,65) = 225.26, p<.001; See Figure 21). For the Consonant
Adaptation task, the predicted interaction was significant for session 1 (F(1,65) = 141.38,
p<.001; See Figure 22) and for session 2 (F(1,65) = 172.37, p<.001; See Figure 23).
These successful replications provide more hope that these tasks can serve as reliable
metrics of cue weighting adaptation.
77
100
90
% "set" responses
80
70
60
50
Long
40
Short
30
20
10
0
Canonical
Reverse
Figure 20. Average percent "set" responses for test stimuli in Canonical
and Reverse blocks in Session 1
100
90
% "set" responses
80
70
60
50
Long
40
Short
30
20
10
0
Canonical
Reverse
Figure 21. Average percent "set" responses for test stimuli in Canonical
and Reverse blocks in Session 2
78
100
90
% "pa" responses
80
70
60
50
High f0
40
Low f0
30
20
10
0
Canonical
Reverse
Figure 22. Average percent "pa" responses for test stimuli in Canonical
and Reverse blocks in Session 1
100
90
% "pa" responses
80
70
60
50
High f0
40
Low f0
30
20
10
0
Canonical
Reverse
Figure 23. Average percent "pa" responses for test stimuli in Canonical
and Reverse Blocks in Session 2.
79
Reliability across sessions
Table 5 provides means and standard deviations for the adaptation measure
described above for both tasks on both sessions. As can be seen from the large standard
deviation scores, there was substantial individual variability in performance on the task,
which is essential for it to serve as a measure of individual differences. A second
important characteristic for this measure to be successful is that these scores should be
correlated across sessions, indicating a stable individual difference as opposed to
measurement error. Pairwise correlations were computed for Consonant and Vowel
adaptation scores on session 1 and session 2. Comparisons reveal significant correlations
at the p <.01 level between Consonant adaptation scores on session 1 and session 2 (r
=.398), as well as significant correlations at the p <.05 level between Vowel adaptation
scores on session 1 and session 2 (r = .261). A linear regression model was computed for
each adaptation task across both sessions, using days in between session as a covariate in
order to account for individual differences in the delay between sessions. Vowel
adaptation scores on day 1 were predictive of vowel adaptation scores on day 2, β = .576,
t(64) = 4.88, p < .001. Intraclass Correlation Coefficient (ICC) for Vowel adaptation
scores across days was .414 with a 95% confidence interval from .044 to .641, F(65) =
1.708, p<.05. Consonant adaptation scores on day 1 also significantly predicted
Consonant adaptation scores on day 2, β = .328, t(64) = 3.83, p < .001. ICC for
Consonant adaptation scores across days was .570 with a 95% confidence interval from
.297 to .737, F(65) = 2.324, p<.001. Scatterplots of individual scores for each task over
80
both sessions are depicted for the Vowel Adaptation task (Figure 24), Consonant
Adaptation task (Figure 25) and average scores for each task (Figure 26).
Table 5. Overall means and standard deviations for each adaptation task for
sessions 1 and 2 (indicated by day).
Adaptation Task
Consonant Adaptation Day 1
Consonant Adaptation Day 2
Vowel Adaptation Day 1
Vowel Adaptation Day 2
Mean Adaptation Score
35.5
38.9
61.8
62.4
Std. Deviation
24.2
24.0
32.5
33.8
Figure 24. Scatterplot of individual scores on the Vowel
Adaptation task for Session 1 (VowAdapt1) plotted against
Session 2 (VowAdapt2).
81
Figure 25. Scatterplot of individual scores on the Consonant
Adaptation task for Session 1 (ConsAdapt1) plotted against Session 2
(ConsAdapt2).
82
Figure 26. Scatterplot of individual scores averaged across both
sessions for the Consonant Adaptation tasks (ConsAdaptAVG)
plotted against average scores for the Vowel Adaptation task
(VowAdaptAVG).
83
Reliability across tasks
To test whether a general perceptual flexibility may underlie both tasks, pairwise
correlations between adaptation scores on each task were computed. As can be seen in
Table 6, there were no significant correlations across tasks at either session. Adaptation
scores were also averaged across days for each task and correlations were not significant
between averaged Consonant and Vowel adaptation scores (r = .147).
Table 6. Correlational analyses within and across adaptation tasks for Session
1 and Session 2.
**Significant at the 0.01 level. *Significant at the 0.05 level.
Adaptation Score Comparisons
Correlations
Consonant & Vowel Day 1
r = .18
Consonant & Vowel Day 2
r = .20
Consonant Day 1 & Vowel Day 2
r = -.03
Consonant Day 2 & Vowel Day 1
r = .03
Consonant Day 1 & Consonant Day 2
r = .39**
Vowel Day 1 & Vowel Day 2
r = .26*
84
3.4 Interim summary and discussion
One of the predictions of this current work is that in order to perform optimally
when listening to degraded speech, a listener must be able to adapt their perception. That
is, the listener will have to rely on secondary cues or a re-weighting of their cue use in
order to match the reliability of the cues available in the degraded signal. In order to test
this prediction, we first had to ascertain whether individual adaptation of cue weighting
abilities are reliable, i.e. do listeners demonstrate the same ability to accommodate their
cue weights over time and across multiple cue weighting tasks? The idea is that if an
individual’s ability to adapt to varying acoustic cues is consistent, then perhaps this
ability can be used to predict performance on a degraded speech task.
For this study, participants were presented two different cue adaptation tasks for
speech sound categories that contained a dominant cue in the spectral domain (spectral
quality in the Vowel adaptation task) and the temporal domain (VOT in the Consonant
adaptation task). Listeners were tasked with categorizing English vowels /ɛ/ and /æ/ in
the Vowel adaptation task, and English stop consonants ‘p’ and ‘b’ in the Consonant
adaptation task. In each block of the experiment, listeners heard exposure stimuli that
either reflected a canonical or reversed relationship between the cues. Adaptation was
measured by the difference in response (percent “set” responses or percent “pa”
responses in the vowel and consonant tasks, respectively) when the primary cue for
categorization was ambiguous in both canonical and reverse blocks. Each individual’s
adaptation score was compared across both tasks and over two separate sessions.
On the positive side, the adaptation tasks did replicate the earlier findings of
Idemaru and Holt (2011) and Liu, Soh and Holt (2015) and the individual variability in
85
adaptation scores suggested that there are individual differences in performance on the
task. Results from Experiment 2 suggest that individual variability in cue weighting
adaptation is indeed consistent over time. Adaptation scores in both tasks were
significantly correlated with scores on the second session when participants repeated the
same tasks. However, these correlations were moderate in size (β values of .576 and
.328, when one controls for days between sessions).
Adaptation scores did not, on the other hand, correlate across tasks. Given that the
dominant cue dimensions for categorization were in separate domains (temporal/spectral)
across tasks, it would seem reasonable that scores on each task were not perfectly
correlated. The complete lack of correlation, however, is somewhat surprising as the
general, perceptual task of adapting ones cue weights for optimal speech categorization is
the same. Whereas these measures do not appear to provide a generalizable measure of
perceptual flexibility, it is possible that each of them will provide predictive power for
individual variability in specific degradations that present a similar listening challenge.
For example, the Consonant adaptation task, which requires the “down-weighting” of a
spectral cue (f0), may be predictive of performance on the noise-vocoding task in which
spectral cues and particularly f0 are degraded. Experiment 3 is an analysis to test the
ability to predict speech degradation performance from adaptation of cue weighting
measures.
86
CHAPTER 4
Predicting individual differences in degraded speech perception
4.1 Introduction
There have been several attempts to account for the variability in degraded speech
perception with measures of basic auditory abilities (e.g., Espinoza, Varas and Watson,
1988; Suprenant and Watson, 2001). These measures included a wide range of auditory
abilities, from pitch and amplitude discrimination of tones, to syllable identification of
nonwords. Attempts to correlate these measures with performance on degraded speech
tasks have been largely unsuccessful. This could be due to two possibilities. The first
possibility is that the variability in performance seen in degraded speech tasks is not a
result of true, stable variability, and instead due to measurement error. The second is that
the general auditory and cognitive measures used thus far have been too coarse to
measure the differences in the specific processing required to accurately perceive
degraded speech. Results from Experiment 1 of the current work demonstrate that the
variability in performance on a variety of degraded speech tasks is indeed reliable and not
due to measurement error. This leaves open the possibility that a more specific test of
auditory and/or cognitive function could better assess performance on a degraded speech
task. Working from this hypothesis, Experiment 2 was designed to specifically test a
listener’s ability to adapt their cue weighting strategy on the basis of changing acoustic
cues in both the spectral (Consonant Adaptation) and temporal (Vowel Adaptation)
87
domain. Results from Experiment 2 suggest that adaptation scores (within tasks) are
indeed reliable - scores across two separate sessions were highly correlated with one
another. Scores across tasks, however, were not significantly correlated with each other.
While this does suggest a lack of generalizable measures of perceptual flexibility from
these tasks, there is still the possibility that the tasks will independently capture aspects of
perception that relate to specific listening challenges of the different degradation types.
Experiment 3 seeks to determine the extent to which individual differences in degraded
speech perception can be explained by these measures of listener cue weighting
flexibility.
Recent attempts have been made to predict individual differences in adaptation to
degraded speech using auditory skills and brain morphology. Erb et al. (2012)
hypothesized that individual differences in adaptation to noise vocoded speech,
specifically, could potentially be predicted by non-speech auditory, cognitive and
neuroanatomical factors. They presented 18 NH listeners with 100 noise-vocoded
sentences (4-channel vocoding), while undergoing functional magnetic resonance
imaging (MRI) scans. The sentences consisted of low-predictability German version of
the English speech in noise (SPIN) sentences (Kalikow, Stevens & Elliott, 1977).
Subjects were also given two working memory tasks (i.e. digit span and nonword
repetition), and an amplitude modulation (AM) rate discrimination task (centered around
4Hz). Results revealed considerable variability in perceptual learning of noise vocoded
speech and ultimately, that AM rate discrimination thresholds were the most predictive of
individual adaptation to noise-vocoded speech. The neuroanatomical factor most
predictive of noise-vocoded speech was an increased grey matter volume (GMV) in the
88
pulvinar; poorer learners, on the other hand, had a larger GMV in the right-lateralized
middle frontal gyrus and inferior temporal gyrus. No correlations were observed between
working memory tasks (both nonword repetition and digit span) and noise-vocoded
speech learning (Erb et al., 2012).
There are several important observations to note from this study as they relate to
the current work. First, it is important to note that the purpose of the study by Erb and
colleagues (2012) was to examine individual differences in the perceptual learning of
noise-vocoded speech. This is different from the current work which focuses instead on
the variability in the immediate ability to perceive degraded speech. Second, it is difficult
to make claims or generalizations about individual differences from results of only 18
participant responses. In fact, the correlation between AM rate discrimination and noisevocoded speech learning is clearly dependent upon scores by two single subjects which,
if removed, render the correlation insignificant. This is not to say that AM rate
discrimination could not be a predictive factor in one’s ability to perceive degraded
speech, but certainly it calls for further examination with a larger subject population.
Recall the study by Choe and colleagues (2009) described in Chapter 1 that
examined the perceptual outcome of NH listeners when presented with dysarthric speech
(a result of a neuromuscular disorder that affects the production of speech). They
observed large individual differences in NH listener performance on transcribing
dysarthric speech. In attempts to uncover the source of this variability, the authors
presented the same listeners with a set of cognitive-perceptual tasks that are presumed
fundamental to adaptive perception of speech. These included classic speech context
effects (e.g. /b/-/w/ categorization shifts in response to following vowel duration), and a
89
working memory span task (Engle, Carullo and Collins, 1991). Despite the intuitive
predictions, size of the shift in classic speech context effecst did not significantly predict
performance on the dysarthric transcription tasks, and WM span correlated very weakly
(r = 0.267).
4.1.2 Working memory capacity
The relationship between working memory and one’s ability to understand speech
in adverse listening conditions has been increasingly examined over the years. The rise in
interest in this relationship is likely due both to the increasingly observed variability in
degraded speech perception amongst normal hearing listeners, as well as the presumption
that this variability may in fact be a result of individual differences in cognitive abilities.
Several recent studies have observed positive correlations between phonological
working memory (WM) and degraded speech learning (Eisner et al., 2010) and successful
speech perception post implantation for children with CI’s (Pisoni & Geers, 2000; Cleary
et al., 2001; Pisoni & Cleary, 2003). A potential problem with these studies, as they relate
to the current study, is the presentation of the degraded speech tasks. In the case of Eisner
et al. (2010), subjects’ degraded speech performance was quantified by a task that
required them to maintain entire sentences in short-term memory before receiving
feedback. Subjects were trained on this task to first listen to a vocoded sentence and then
a written version of the sentence was presented on the screen. Test trials consisted of
sentences that they were instructed to listen to first and then repeat back as much as they
could. It is not surprising then, that measures of WM would correlate strongly with
performance on this task, when it is arguably a task of WM in itself.
90
Another example comes from a recent study by Adank and Janse (2010),
examining differences in perceptual adaptation to accented speech by young and old
adults. Twenty younger adults (mean age = 23.3 years) and thirty older adults (mean age
= 74.1) were presented a novel accent created by a native Dutch speaker pronouncing
sentences with altered vowels. The sentences were embedded in speech-shaped noise,
and the listeners’ adaptation to the accent was measured by the change in speech
reception thresholds (SRT) over 4 block presentations. Additionally, the group of older
participants were given two tests of cognitive measures – the Digit-Symbol Substitution
(DSS, as part of the Wechsler Adult Intelligence Test, 1999), and the Trail Making Test
(TMT; Reitan, 1958). The first was intended as an index of information-processing
speed, and the second, a measure of executive functioning or task switching. As one
would expect, the older listeners had more difficulty understanding the novel accent than
the younger listeners, but there was no difference between groups in rate and magnitude
of adaptation. While the authors did not find associations between adaptation to accented
speech and the DSS (information processing speed), there was a weak interaction
observed between the TMT (measure of executive function) and the different conditions
used to examine adaptation to accented speech in a mixed-effects model. It is important
to note, however, that the TMT did not predict overall performance in the adaptation to a
novel accent.
In a similar study, Banks et al. (2015) tested the effects of several measures of
cognitive ability (including WM) on perceptual adaptation to accented speech. One
hundred listeners were presented accented speech manipulated similarly to the accented
speech from Adank and Janse, (2010) and SRTs were used to measure adaptation across
91
multiple presentation blocks. They were also presented a series of cognitive measures
including vocabulary knowledge, using the Wechsler Abbreviated Scale of Intelligence
(WASI; Wechsler, 1999), a standard Stroop test (Stroop, 1935), and a standard reading
span as a test of WM (Ronnberg et al., 1989). The investigators observed that while
vocabulary scores did significantly predict recognition accuracy for accented speech,
WM scores did not.
Despite the lack of evidence in support of WM capacity and degraded speech
performance, it is still a general assumption that WM is the cognitive system for
maintenance and manipulation of active/online information. Several researchers have
demonstrated that WM even relates to language processes, including language learning
(Baddeley, Gathercole & Papagno, 1998) and comprehension (Daneman & Merikle,
1996). The general consensus is that successful perception of speech (phrases and
sentences) depends on higher-order cognitive processes. However, the hypothesis of the
current work is that one’s ability to perceive degraded speech is mediated by perceptual
adaptability, independent of working memory. Therefore, it is prudent to include a
measure of WM in the current study to rule-out the possibility that the consistent
variability across listener performance is a direct result of variability in WM capacity.
For this study, a shortened version of the Operation Span Task (Foster et al.,
2014) was used (see Chapter 2 for a detailed description of the task), which correlates
with other measures of WM and serves as a valid measure of working memory capacity
(WMC) in a significantly shorter amount of time (approximately 20 minutes) than then
typical measures of WMC. The task is completely automated and runs in E-prime
software. Subjects were seated in front of computer monitor and asked to recall
92
sequences of orthographic letters while solving basic math problems. WMC scores were
quantified by a participant’s ‘Partial Span Score’ which is equal to the total number of
items recalled in the correct order on memory trials.
4.2 Experiment 3 – Predicting degraded speech performance
4.2.1 Methods
Data for the current analyses consist of the 66 participant scores on the three
degraded speech tasks (noise-vocoded speech, compressed speech and speech in babble
noise), two adaptation tasks (Vowel Adaptation and Consonant Adaptation) and one
working memory task (OSPAN). All 66 subjects participated in these tasks, repeating the
degraded speech and adaptation tasks on subsequent days. The number of days in
between tasks varied, but was no less than 2 days (see Appendix B for the breakdown of
days in between sessions for each individual subject). Each subject performed all of the
tasks as follows:
Day 1 – Degraded speech tasks Session 1 and WM task
Day 2 – Consonant and Vowel Adaptation tasks Session 1
Day 3 – Degraded speech tasks Session 2
Day 4 – Consonant and Vowel Adaptation tasks Session 2
Analyses
Analyses for the current study utilized scores on all tasks, averaged across sessions. Pairwise correlations were computed for average percent correct scores on each degraded
93
speech task, with averaged adaptation scores for the Vowel and Consonant Adaptation
tasks. The working memory score (i.e. OSPAN partial score) was also included.
4.2.2 Results
Working Memory
OSPAN partial scores revealed no significant correlations between any of the individual
degraded speech tasks or adaptation tasks. Table 7 provides a list of all task comparisons
and corresponding r values.
Vowel Adaptation Paradigm
Average Vowel Adaptation scores (averaged across both sessions) did not correlate
significantly with any of the individual degraded speech tasks, nor did they correlate with
the working memory (i.e. OSPAN) score (see Table 7).
Consonant Adaptation Paradigm
Average Consonant Adaptation scores (averaged across both sessions) also did not
correlate significantly with any of the individual degraded speech tasks, nor did they
correlate with the working memory (i.e. OSPAN) score (see Table 7).
All Variables
A linear regression model was calculated predicting the average degraded speech
performance as a function of all three predictor variables (OSPAN, Vowel and Consonant
adaptation). This model resulted in an r-square of .018, which was not significantly
different from zero (p = .77). Figures 27-31 show scatterplot data for scores on each
adaptation task and degraded speech task, separately; there is a clear lack of correlation
and no evidence of possible polynomial function that would fit the data.
94
Table 7. Overall correlations between average performance on each adaptation
task, degraded speech task and working memory task
Task Comparisons
Correlations
Consonant AVG & Babble AVG
r = -.09
Consonant AVG & Compressed AVG
r = -.09
Consonant AVG & NV AVG
r = .10
Consonant AVG & Total Degraded Speech AVG
r = -.02
Vowel AVG & Babble AVG
r = -.21
Vowel AVG & Compressed AVG
r = .01
Vowel AVG & NV AVG
r = -.10
Vowel AVG & Total Degraded Speech AVG
r = -.10
OSPAN & Babble AVG
r = -.10
OSPAN & Compressed AVG
r = .07
OSPAN & NV AVG
r = .05
OSPAN & Total Degraded Speech AVG
r = .09
OSPAN & Consonant AVG
r = .01
OSPAN & Vowel AVG
r = -.14
95
% Correct Speech in Babble Noise
ID
100.00
90.00
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
Vowel Adaptation Score
Figure 27. Individual data for Vowel Adaptation score and percent correct word
recognition on the Speech in Babble Noise task.
96
% Correct Compressed Speech ID
100.00
90.00
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
Vowel Adaptation Score
Figure 28. Individual data for Vowel Adaptation score and percent
correct word recognition on the Compressed Speech task.
% Correct Noise-Vocoded Speech ID
100.00
90.00
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
Vowel Adaptation Score
Figure 29. Individual data for Vowel Adaptation score and percent
correct word recognition on the Noise-Vocoded Speech task
97
% Correct Compressed Speech ID
100.00
90.00
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
Consonant Adaptation Score
% Correct Noise-Vocoded Speech ID
Figure 30. Individual data for Consonant Adaptation score and
percent correct word recognition on the Compressed Speech task.
100.00
90.00
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
Consonant Adaptation Score
Figure 31. Individual data for Consonant Adaptation score and
percent correct word recognition on the Noise-Vocoded Speech task.
98
4.3 Interim summary and discussion
The third aim of the current work was to determine the extent to which individual
differences in degraded speech perception could be predicted or explained by measures of
cue weighting flexibility. There is a long history of unsuccessful attempts to predict
adaptation to degraded speech. Initial attempts consisted largely of measures of general
auditory abilities (Watson, 1991; Suprenant and Watson, 2001), under the presumption
that adaptation to an adverse listening condition requires that a listener become sensitive
to changes in the auditory signal. Working memory capacity has also been commonly
presumed a factor in perception of degraded speech (Choe et al, 2009; Adank and Janse,
2010; Erb et al., 2012, among others). As an ability that involves the retention of
incoming information and simultaneous attention to separate tasks, working memory
would seem like an intuitive predictor of degraded speech perception. Unfortunately,
attempts to correlate working memory scores with adaptation to degraded speech have
been weak at best. The lack of predictive power from general auditory and working
memory abilities has led to the investigation of more speech specific measures of
adaptability. Work by Choe and colleagues (2009), for example, examined whether the
size in shifts obtained by classic speech context effects would be more predictive of
degraded speech perception. The idea was that these effects are more fundamental to
adaptive speech perception and therefore more likely to determine whether a listener will
fare better (or not) in a challenging listening environment. Ultimately, their measures of
speech context effects were not predictive of degraded speech perception.
99
The goal of the current work was to explore another possible predictor of
degraded speech performance, building on a more specific measure of adaptive
perception to speech. Experiment 3 served as an analysis of the relationship between a
listener’s cue weighting flexibility, working memory capacity and degraded speech
perception scores. Degraded speech data consisted of percent correct scores for each
degraded speech task (noise-vocoded, compressed and in babble noise), averaged across
sessions. Cue weighting adaptation data included adaptation scores on each task (Vowel
and Consonant Adaptation), also averaged across both sessions. Results from
Experiments 1 and 2 showed that performance on all tasks (degraded speech and
adaptation tasks) is consist over time (i.e. over 2 separate sessions). Therefore, scores on
all tasks were averaged across both sessions to capture overall performance. Working
memory scores were also including in the analysis
Not only was working memory capacity not predictive, contrary to the predictions
of Experiment 3, neither of the adaptation tasks was successful at predicting performance
on the degraded speech tasks. Despite the lack of generalizability across adaptation tasks
(as determined from Experiment 2), it was expected that they might independently
provide predictive power related to degradation types that present similar listening
challenges. Under this presumption, one would expect that the Consonant adaptation task
would be predictive of performance on the noise-vocoded speech task since it requires
the “down-weighting” of the spectral cue (f0), which is heavily degraded in noisevocoded speech. However, this presumption did not hold, and neither task was predictive
of degraded speech performance. A regression model including all of the predictor
variables also failed to be significant.
100
CHAPTER 5
Discussion
5.1 Summary of main findings
The experiments discussed in the current work investigate individual differences
in normal hearing (NH) listeners’ perception of degraded speech. Two main questions are
addressed in this work. First, does the performance variability for NH listeners on
degraded speech tasks reflect a stable individual difference? Second, to what extent can
this variability be predicted by a measure of perceptual flexibility?
Chapter 2 provided a comparison in performance across different types of
degraded speech tasks. The goal was to determine whether individual differences in
degraded speech perception are stable, true differences. This was achieved by presenting
NH listeners with 3 different types of speech degradation, each presenting different
challenges to the listener: (1) noise-vocoded speech – with a degraded spectral
representation of the speech signal; (2) time-compressed speech – with altered temporal
representation; (3) speech in babble noise – which compromises both the temporal
envelope and spectral representation unless one can segregate the speech and noise. Each
degradation type was presented as a word recognition task. All of the participants
returned for a second session to repeat the word recognition task for the same three
degradation types. The main requirement for performance on degraded speech tasks to
be viewed as a true independent difference as opposed to measurement error is a
101
correlation in performance across sessions. For each of the three tasks, there was a
significant correlation between performance on session 1 and session 2, with days
between sessions entered as a covariate. The β values for the compressed and noise
vocoded tasks indicated a strong agreement in performance across the two sessions (0.74
and 0.79, respectively). The correlation across sessions was much more modest for the
speech in babble noise task (β = 0.37). This difference may be surprising because the
speech in noise task is presumably the closest to real-world listening challenges that the
NH listener is likely to be familiar with. If there are differences in strategy or
performance, one may presume that they would be most stable across sessions for this
commonly encountered task as one would assume that the average listener is much more
“practiced” with this form of degradation compared to noise-vocoded or compressed
speech.One possibility is that the familiarity and practice with this task would result in
similar performance across listeners that may result in lower variability across both
sessions, which could lead to greater influence of measurement noise. Whereas the
speech in noise task showed the least amount of variability of the tasks, it was not
substantially less variable (average standard deviation. = 8.1% for speech in noise; 9.3%
for compressed speech; 8.9% for noise vocoded speech). It may be pertinent for future
studies to examine whether individual differences are more stable for less familiar tasks.
However, for the current study, it is clear that there is substantial agreement in
performance across sessions as indicated by the β = 0.80 for predicting performance
across sessions when percent correct is averaged across the three tasks.
The second question about degraded speech performance was whether there was a
correlation in performance across the tasks. This could indicate a general ability
102
underlying some of the performance variability across multiple types of listening
challenges. The analysis indicates that there was substantial agreement in performance
across tasks (averaged across the two sessions). The three tasks were each significantly
pairwise correlated with each other (r’s = .53 - .59) and regression models that predicted
performance on one task based on performance of the other two also showed a strong
relationship (β = 0.63-0.67). These results provide hope that one could find general
measures of perceptual/cognitive ability that may predict degraded speech performance
across a variety of different forms of degradation and resultant listening challenges. One
possibility for such a general ability would be the flexibility in “weighting” different
acoustic cues based on their reliability in the degraded signal.
The goal of Chapter 3 was to examine whether two specific cue weighting
adaptation tasks – Vowel Adaptation and Consonant Adaptation – may be suitable
measure of perceptual flexibility related to degraded speech perception. The rationale for
using these tasks was that they each provided a measure of adaptability to changing
acoustic information in the speech signal – a task similar to degraded speech perception.
Each task capitalized on 2 specific cues to speech categorization. For the Vowel
Adaptation task, these were spectral quality (primary cue) and duration, for the vowels /ɛ/
and /æ/. Cues for the Consonant Adaptation task consisted of voice-onset time (primary
cue) and fundamental frequency, for the stop consonants /b/ and /p/. In each case, the task
for the listener was to “down-weight” or rely less on the secondary cue for categorization
as it became less informative during the task. In order to serve as a potential predictor of
degraded speech perception performance these tasks needed to show sufficient intersubject variability and reliability across multiple sessions. The results of the experiment
103
in Chapter 3 demonstrated sufficient individual variability in the amount of “downweighting” between a canonical cue relationship and a reversed cue relationship.
Furthermore, there was a significant correlation in this measure across two sessions, with
days between sessions entered as a covariate. The magnitude of this relationship,
however, was moderate especially for the Consonant Adaptation task (β = 0.33; β = 0.58
for Vowel Adaptation task).
Surprisingly, there was no significant correlation found for the “down-weighting”
measure across tasks. The lack of a correlation provides little evidence for a general
perceptual flexibility ability underlying both tasks. However, there was some hope that
the two tasks may provide independent predictions of degraded speech performance,
since one of the tasks requires down-weighting of a spectral cue (Consonant Adaptation)
and the other requires down-weighting of a temporal cue (Vowel Adaptation).
Chapter 4 examined the extent to which the measures of cue weighting
adaptability from chapter 3 were predictive of degraded speech performance. This
chapter details the comparisons between performance scores on the degraded speech
perception tasks and cue-weighting adaptation tasks. A working memory span task was
also included in the analysis, as it remains a commonly presumed factor affecting
adaptation to degraded speech. As was the case for previous work on attempting to
predict NH degraded speech performance (e.g., Suprenant & Watson, 2001; Choe et al.,
2009), the current study failed to find predictors that could predict a substantial amount
of the individual variance in performance. In fact, none of the predictor variables was
significantly related to performance on any of the degraded speech tasks nor on
performance averaged across tasks. In fact, a regression model with all predictor
104
variables entered (vowel and consonant adaptation and working memory capacity) failed
to significantly predict averaged speech performance with a low r² = 0.018. The lack of
predictive power for the working memory measure may be surprising given the widespread belief that working memory capacity is essential for the complex processing
involved in degraded speech comprehension. However, this result agrees with previous
work that has failed to demonstrate a strong relationship between measures of working
memory and degraded speech performance (e.g., Choe et al., 2009; Adank & Janse, 2010;
Banks et al. 2015).
Despite the failure of previous attempts to relate general psychoacoustic abilities
to speech perception performance (Watson, 1991; Supernant & Watson, 2001, Choe et al.
2009), there was reason to believe that the current adaptation measures would have
demonstrated some predictive power in the degraded speech tasks because they are closer
to processes presumed to play a role in successful perception of degraded speech. When
information in the signal is degraded, listeners must presumably use the redundant nature
of the speech signal to successfully comprehend the speech by using the more reliable
portions of the signal. For example, the spectral degradation that occurs because of
noise-vocoding may be accommodated by down-weighting spectral information and upweighting the relatively preserved temporal information in the signal. The Adaptation
tasks used in this study, changed the correlations between cues and resulted in the downweighting of previously useful cues. It was thought that individual differences in this
down-weighting measure would be closely related to the task demands of degraded
speech perception. However, the lack of correlations in performance suggests that the
105
processes involved in these tasks may not overlap with sources of variability in degraded
speech performance.
This failure of a measure from a well-established empirical task of phonetic-level
adaptation to predict degraded speech perception is similar to the failure of Choe et al.
(2009) to predict perceptual performance on dysarthric speech using measures of
phonetic context effects. One possibility is that the measures used from these tasks
(differences in responses to test stimuli between Canonical and Reversed blocks in the
current project and differences between responses to test stimuli as a function of the
context in Choe et al., 2009) are not the best variables for examining the underlying
adaptive process. It may be informative to examine differences from a baseline measure
or reaction time or some other measure of performance. However, another possibility is
that the phenomena of adaptation at the phonetic segment level (whether context effects,
talker normalization, or cue weighting adaptation) are not relevant to perception at the
word or sentence level. Holt and Lotto (2010; see also Lotto & Holt, 2000) have
questioned whether phonemic categorization tasks are relevant to speech perception “in
the wild” and presented some arguments based on clinical double-dissociations between
word recognition tasks (Miceli, Gainotti, Caltagirone, & Masullo, 1980) and phoneme
categorization/discrimination tasks (Blumstein, 1995) that the two types of tasks are not
completely overlapping. Faber (1992) and Port (2007; 2010) have also proposed that the
phonetic segments are not part of perception of spoken language normally. The complete
lack of correlation on performance on perceptual adaptation tasks at the phonetic level
and at the level of words and sentences (Choe et al. 2009; and the current study) could
provide additional evidence for the non-overlap of processes involved in these tasks.
106
Whether or not this theoretical distinction between these types of tasks is true, the search
for predictors of performance on degraded speech tasks remains important, especially
considering the demonstrations from Chapter 2 that these are stable general individual
differences.
5.2 Conclusions and future directions
Clinical Implications
The results of the analyses from Chapter 2 indicate that the performance
variability observed for NH listeners in challenging listening tasks is reliable and is
grounded in differences in some general cognitive/perceptual ability as indicated by the
correlations in performance across tasks. Watson’s (1991) proposal that these individual
differences in NH listeners may provide some explanation for the remarkable variability
in speech perception outcomes for HI individuals still remains viable. This is an
important line of work for its potential implications for the hearing impaired population
as there exist no good predictors of outcome performance for hearing aids or cochlear
implants. Although there were no successful predictors of individual performance on
degraded speech in this study, the work constrains further the type of measures that may
be predictive. A better understanding of the causes of individual differences would
significantly benefit the goal of predicting outcome measures for HI individuals. It may
also shed some light on questions of why listeners receive varying degrees of benefit
from rehabilitative strategies.
Strengths and limitations
107
A strength of the current work is that the tasks involved in measuring adaptation
to degraded speech are measures specific to the task of degraded speech perception.
Unlike many of the previous attempts to predict outcome performance, the tasks
implemented in this study include measures of adaptation to specific spectral and
temporal cues that are presumably relevant to the specific speech degradation tasks in this
study. In addition, this is the first study to look at performance variability across a
number of degradation types and across multiple sessions with a large number of subjects
(66). This allows one to examine whether this variability represents true stable individual
differences.
It is important to note several limitations of the current work. First, the test stimuli
in both the Canonical and Reverse blocks of each adaptation task were randomly
presented. While this allowed for a measure of overall adaptation across the two blocks, it
does not allow one to examine when or how quickly the adaptation occurs. Because the
nature of the current work on individual differences is specific to a more immediate
perception of degraded speech (as opposed to long-term adaptation or perceptual
learning), a measure of how quickly listeners adapt their cue-weighting strategy might be
indicative of how well they would perform on a degraded speech task.
Second, there is a potential discrepancy between listening effort and accuracy
when considering overall performance scores on a degraded speech task. In a real-world
scenario, this could apply, for instance, to two individuals with similar hearing loss. With
amplification, both could exhibit comparable intelligibility scores, but different levels of
effort to achieve that performance (Gosselin & Gogne, 2011). Because this study did not
include a measure of listening effort, it remains unclear whether this discrepancy
108
influenced the lack of significant correlation between degraded speech scores and
adaptation measures.
Finally, there is the question of whether examining a listener’s weighting of a
single acoustic cue at a time (as in the Vowel and Consonant Adaptation tasks) will most
accurately capture the perceptual demands of understanding speech in an adverse
listening environment. In particular, the adaptation tasks only examined the downweighting of a secondary cue in phonetic categorization. One may presume that in the
degraded speech tasks listeners may need to up-weight a secondary cue while downweighting a primary cue. For example, formant frequencies for vowel categorization
may need to be down-weighted for noise-vocoded speech, whereas the duration cues
would need to be up-weighted. Perhaps a more complex cue weighting paradigm could
provide more insight into this ability at the phonetic level (though the resulting measures
would be more complicated).
Future directions
The search for predictors of individual differences in degraded speech perception
remains important both for practical clinical impacts, such as deriving predictors of
outcomes for those with hearing loss, and for theoretical impacts in development of
cognitive models of speech perception. The failure of basic psychoacoustic measures,
adaptation of phonetic categorization measures, and working memory measures to predict
substantial variability in degraded speech performance can provide some direction about
what variables to examine next. In some sense, these three classes of measures were the
most simple and intuitive measures covering basic auditory encoding, basic perceptual
phonetic processing, and one of the most fundamental measures of cognitive capacity.
109
Moving forward, we are likely faced with investigating more complex and less intuitive
measures. Because the number of non-intuitive complex measures is very large, we
would be best served to have a better understanding of the task demands and overlapping
processes involved in these tasks. The current study used three types of degradation for a
mono-syllabic word recognition task. Despite the different listening challenges posed by
these three kinds of degradation, there was substantial overlap in performance. While
this overlap is interesting in that it points to a general ability underlying speech
perception, akin to Spearman’s (1904) g factor of general intelligence, it tells us less
about how many different processes may be involved in these tasks. For example, if
comparison of the compressed and noise vocoded tasks had demonstrated lower
correlations in performance, then one may have proposed that there were nonoverlapping processes involved in accommodating changes in normal spectral versus
temporal cues. In order to get some indication of the number and types of processes
involved in degraded speech perception, one would need to include a larger number of
listening challenges. Most previous studies have only used one type of degradation, but
one could include in addition to the three used in the current study: sine-wave speech
(Remez, Rubin, Pisoni & Carrell, 1981), temporally-reversed speech (Saberi & Perrott,
1999; where short temporal windows 50-150 ms are reversed), speech produced by
dysarthric inividuals (Choe et al. 2009), high-pass filtered speech (Vitela et al., 2014),
reduced speech (Ernestus & Warner, 2011), accented speech, and speech in the presence
of a single competing talker. Presumably these tasks share some common processes of
adaptation but also differ in what is required to achieve high performance. Based on how
110
performance on these tasks cluster, one may be able to develop better ideas of what types
of measures could be predictive of performance across clusters and within each cluster.
In addition to examining a large set of degradation types, one may also wish to
examine performance across linguistic levels. One could compare performance on
phonemic categorization tasks, word recognition tasks and sentence comprehension tasks
with different types of degradation. As discussed previously, there have been several
researchers who have argued that phonetic categorization tasks are irrelevant to
perception at the word or sentence level. In addition, there may be reason to believe that
perception of sentences and words are not completely overlapping tasks. Carbonell and
Lotto (2013) presented NH listeners with noise-vocoded isolated words or the same
words embedded in the carrier phrase “The next word on the list is…” Despite the fact
that the carrier phrase provided no semantic or syntactic predictability of the target word,
listeners were substantially more accurate transcribing the word in the phrase. Carbonell
and Lotto (2013) took this result to indicate that perception of words in a sentence is a
different task from presenting them in isolation. Examining different degradation types
across these three linguistic levels would provide tests of some of these theoretical
questions but would also provide a large enough database to allow one to start creating a
model of the processes involved in degraded speech perception.
Whereas the current study did not find evidence that cue weighting adaptability at
the phonetic level was predictive of degraded speech perception, the idea of individual
differences in perceptual flexibility still holds as an intuitively appealing idea. Perhaps it
is necessary to define flexibility at a larger level than the phoneme (i.e. word or sentence
level) or to develop more complex measures of adaptability. In the long run, such
111
measures may be a hope for understanding variability in hearing impairment outcomes
and may even lead to more individually tailored solutions and therapies.
112
APPENDIX A
Below are the 3 randomized lists of words used in the degraded speech tasks.
113
APPENDIX B
The table below outlines the number of days for in between sessions for degraded speech
and adaptation tasks each subject.
Ss
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
S16
S17
S18
S19
S20
S21
S22
S23
S24
S25
S26
S27
S28
S29
S30
S31
S32
S33
S34
Days between
Degraded Speech tasks
10
10
15
7
8
4
9
5
2
7
15
18
7
13
15
8
7
18
10
9
8
8
2
2
12
12
13
10
5
3
32
2
6
6
Days between
Adaptation tasks
9
3
8
5
13
2
13
4
1
10
10
11
1
7
7
3
7
14
7
6
6
2
5
6
5
4
6
5
5
6
46
15
1
4
114
S35
S36
S37
S38
S39
S40
S41
S42
S43
S44
S45
S46
S47
S48
S49
S50
S51
S52
S53
S54
S55
S56
S57
S58
S59
S60
S61
S62
S63
S64
S65
S66
6
5
7
5
5
47
7
3
5
8
2
16
8
8
7
22
12
5
7
22
10
15
2
8
8
7
3
12
12
8
14
5
5
5
6
5
5
46
5
2
4
1
1
2
2
7
12
8
7
1
25
4
21
13
1
7
8
5
1
5
5
1
14
1
115
REFERENCES
Abramson, A. S., & Lisker, L. (1964). A cross-language study of voicing in initial stops:
Acoustical measurements.
Abramson, A. S., & Lisker, L. (1985). Relative power of cues: F0 shift versus voice
timing. Phonetic linguistics: Essays in honor of Peter Ladefoged, 25-33.
Adank, P., & Janse, E. (2010). Comprehension of a novel accent by young and older
listeners. Psychology and aging, 25(3), 736.
Adobe Systems, Inc. (2004). Adobe audition 1.5. San Jose, CA.
Akeroyd, M.A.(2008).Are individual differences in speech reception related to individual
differences in cognitive ability? A survey of twenty experimental studies with normal and
hearing-impaired adults. Int.J.Audiol. 47(Suppl.2), S53–S71
Azadpour, M., & Balaban, E. (2015). A proposed mechanism for rapid adaptation to
spectrally distorted speech. The Journal of the Acoustical Society of America, 138(1), 4457.
Baddeley, A., Gathercole, S., & Papagno, C. (1998). The phonological loop as a language
learning device. Psychological review, 105(1), 158.
Banks, B., Gowen, E., Munro, K. J., & Adank, P. (2015). Cognitive predictors of
perceptual adaptation to accented speech. The Journal of the Acoustical Society of
America, 137(4).
Berg, B. G. (1989). Analysis of weights in multiple observation tasks. The Journal of the
Acoustical Society of America, 86(5), 1743-1746.
Besser, J., Festen, J. M., Goverts, S. T., Kramer, S. E., & Pichora-Fuller, M. K. (2015).
Speech-in-Speech Listening on the LiSN-S Test by Older Adults With Good Audiograms
Depends on Cognition and Hearing Acuity at High Frequencies. Ear and hearing, 36(1),
24-41.
Bilger, R. C., & Wang, M. D. (1976). Consonant confusions in patients with
sensorineural hearing loss. Journal of Speech, Language, and Hearing Research, 19(4),
718-748.
Blumstein, S. E. (1995). The neurobiology of the sound structure of language.
Boersma, P., & Weenink, D. (2005). Praat: doing phonetics by computer (Version 4.3.01)
[Computer program]. Retrieved from http://www.praat.org/
116
Boersma, P., & Weenink, D. (2009). Praat: doing phonetics by computer (Version 5.1.
05)[Computer program]. Retrieved May 1, 2009.
Brunswik, E. (1955). Representative design and probabilistic theory in a functional
psychology. Psychological review, 62(3), 193.
Brunswik, E. (1947). Systematic and representative design of psychological experiments.
In Proceedings of the Berkeley symposium on mathematical statistics and probability.
Carbonell, K.M., & Lotto, A.J. “Word Recognition in Isolation vs. a Carrier Phrase”
166th Meeting of the Acoustical Society of America; December 3, 2013.
Ching, T. Y., Dillon, H., & Byrne, D. (1998). Speech recognition of hearing-impaired
listeners: Predictions from audibility and the limited role of high-frequency amplification.
The Journal of the Acoustical Society of America, 103(2), 1128-1140.
Choe, Y-k., Liss, J.M., Lotto, A.J., Azuma, T., & Mathy, P. (2009) “Individual
differences
in speech perception,” in American Speech, Language and Hearing Association; New
Orleans, Louisiana.
Cleary, M., Pisoni, D. B., & Geers, A. E. (2001). Some measures of verbal and spatial
working memory in eight-and nine-year-old hearing-impaired children with cochlear
implants. Ear and hearing, 22(5), 395.
Crandell, C. C. (1991). Individual differences in speech recognition ability: implications
for hearing aid selection. Ear and hearing, 12(6), 100S-108S.
Dai, H. (2000). On the relative influence of individual harmonics on pitch judgment. The
Journal of the Acoustical Society of America, 107(2), 953-959.
Dai, H., & Berg, B. G. (1992). Spectral and temporal weights in spectral‐shape
discrimination. The Journal of the Acoustical Society of America, 92(3), 1346-1355.
Daneman, Meredyth, and Philip M. Merikle. "Working memory and language
comprehension: A meta-analysis." Psychonomic Bulletin & Review 3.4 (1996): 422-433.
Dirks, D. D., Bell, T. S., Rossman, R. N., & Kincaid, G. E. (1986). Articulation index
predictions of contextually dependent words. The Journal of the Acoustical Society of
America, 80(1), 82-92.
Doherty, K. A., & Lutfi, R. A. (1996). Spectral weights for overall level discrimination in
listeners with sensorineural hearing loss. The Journal of the Acoustical Society of
America, 99(2), 1053-1058.
117
Dubno, J. R., Dirks, D. D., & Morgan, D. E. (1984). Effects of age and mild hearing loss
on speech recognition in noise. The Journal of the Acoustical Society of America, 76(1),
87-96.
Ernestus, M., & Warner, N. (2011). An introduction to reduced pronunciation
variants. Journal of Phonetics, 39(3), 253-260.
Eisner, F., McGettigan, C., Faulkner, A., Rosen, S., & Scott, S. K. (2010). Inferior frontal
gyrus activation predicts individual differences in perceptual learning of cochlear-implant
simulations. The Journal of Neuroscience, 30(21), 7179-7186.
Elsabbagh, M., Cohen, H., Cohen, M., Rosen, S., & Karmiloff‐Smith, A. (2011). Severity
of hyperacusis predicts individual differences in speech perception in Williams
syndrome. Journal of Intellectual Disability Research, 55(6), 563-571.
Engle, R. W., Carullo, J. J., & Collins, K. W. (1991). Individual differences in working
memory for comprehension and following directions. The Journal of Educational
Research, 84(5), 253-262.
Erb, J., Henry, M. J., Eisner, F., & Obleser, J. (2012). Auditory skills and brain
morphology predict individual differences in adaptation to degraded
speech.Neuropsychologia, 50(9), 2154-2164.
Ernestus, M., & Warner, N. (2011). An introduction to reduced pronunciation
variants. Journal of Phonetics, 39(3), 253-260.
Escudero, P., Benders, T., & Lipski, S. C. (2009). Native, non-native and L2 perceptual
cue weighting for Dutch vowels: The case of Dutch, German, and Spanish
listeners. Journal of Phonetics, 37(4), 452-465.
Espinoza‐Varas, B., & Watson, C. S. (1988). Low commonality between tests of auditory
discrimination and of speech perception. The Journal of the Acoustical Society of
America, 84(S1), S143-S143.
Faber, A. (1992). Phonemic segmentation as epiphenomenon. P. Downing, SD Lima, SD,
& M. Noonan, M.(Eds.), The linguistics of literacy, 21, 111-134.
Foster, Shipstead, Harrison, Hicks, Redick, & Engle (2014). Shortened complex span
tasks can reliably measure working memory capacity. Memory and Cognition. DOI
10.3758/s13421-014-0461-7
Francis, A. L., Ciocca, V., Ma, L., & Fenn, K. (2008). Perceptual learning of Cantonese
lexical tones by tone and non-tone language speakers. Journal of Phonetics, 36(2), 268294.
118
Francis, A. L., Kaganovich, N., & Driscoll-Huber, C. (2008). Cue-specific effects of
categorization training on the relative weighting of acoustic cues to consonant voicing in
English. The Journal of the Acoustical Society of America,124(2), 1234-1251.
Fu, Q. J., Shannon, R. V., & Wang, X. (1998). Effects of noise and spectral resolution on
vowel and consonant recognition: Acoustic and electric hearing. The Journal of the
Acoustical Society of America, 104(6), 3586-3596.
Füllgrabe, C., Moore, B. C., & Stone, M. A. (2015). Age-group differences in speech
identification despite matched audiometrically normal hearing: contributions from
auditory temporal processing and cognition. Frontiers in Aging Neuroscience, 6, 347.
Ghitza, O., & Greenberg, S. (2010). Intelligibility of time-compressed speech with
periodic and aperiodic insertions of silence: evidence for endogenous brain rhythms in
speech perception?. In The Neurophysiological Bases of Auditory Perception (pp. 393405). Springer New York.
Gosselin, P. A., & Gagne, J. P. (2011). Older adults expend more listening effort than
young adults recognizing speech in noise. Journal of Speech, Language, and Hearing
Research, 54(3), 944-958.
Green, K. M. J., Bhatt, Y. M., Mawman, D. J., O'driscoll, M. P., Saeed, S. R., Ramsden,
R. T., & Green, M. W. (2007). Predictors of audiological outcome following cochlear
implantation in adults. Cochlear implants international, 8(1), 1-11.
Hazan, V., & Barrett, S. (2000). The development of phonemic categorization in children
aged 6–12. Journal of phonetics, 28(4), 377-396.
Held, R. (1965). Plasticity in sensory-motor systems. Scientific American.
Hillenbrand, J.M. & Gayvert, R.T. (2003). Open-source software for experiment design
and control, JSLHR, 48, 45-60.
Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics
of American English vowels. The Journal of the Acoustical society of America, 97(5),
3099-3111.
Lotto, A. J., & Holt, L. L. (2000). The illusion of the phoneme. Chicago Linguistic
Society.
Holt, L. L., & Lotto, A. J. (2006). Cue weighting in auditory categorization: Implications
for first and second language acquisition). The Journal of the Acoustical Society of
America, 119(5), 3059-3071.
Holt, L. L., & Lotto, A. J. (2010). Speech perception as categorization.Attention,
Perception, & Psychophysics, 72(5), 1218-1227.
119
Hombert, J. M. (1978). Consonant types, vowel quality, and tone. Tone: A linguistic
survey, 77, 112.
House, A. S., & Fairbanks, G. (1953). The influence of consonant environment upon the
secondary acoustical characteristics of vowels. The Journal of the Acoustical Society of
America, 25(1), 105-113.
Hughes, M. L., & Stille, L. J. (2008). Psychophysical versus physiological spatial
forward masking and the relation to speech perception in cochlear implants. Ear and
hearing, 29(3), 435.
Humes, L. E. (1996). Speech understanding in the elderly. Journal-American Academy of
Audiology, 7, 161-167.
Idemaru, K., & Holt, L. L. (2011). Word recognition reflects dimension-based statistical
learning. Journal of Experimental Psychology: Human Perception and Performance,
37(6), 1939.
Idemaru, K., & Holt, L. L. (2014). Specificity of dimension-based statistical learning in
word recognition. Journal of Experimental Psychology: Human Perception and
Performance, 40(3), 1009.
Iverson, P., Kuhl, P. K., Akahane-Yamada, R., Diesch, E., Tohkura, Y. I., Kettermann,
A., & Siebert, C. (2003). A perceptual interference account of acquisition difficulties for
non-native phonemes. Cognition, 87(1), B47-B57.
Janse, E. (2004). Word perception in fast speech: artificially time-compressed vs.
naturally produced fast speech. Speech Communication,42(2), 155-173.
Janse, E., Nooteboom, S., Quene, H., 2003. Word-level intelligibility of time-compressed
speech: prosodic and segmental factors. Speech Comm. 41, 287–301.
Kalikow, D.N.,Stevens,K.N.,&Elliott,L.L.(1977).Developmentofatestofspeech
intelligibility innoiseusingsentencematerialswithcontrolledwordpredict- ability. Journal
oftheAcousticalSocietyofAmerica, 61, 1337–1351
Kingston, J., & Diehl, R. L. (1994). Phonetic knowledge. Language, 419-454.
Kohler, K. J. (1982). F0 in the production of lenis and fortis plosives. Phonetica, 39(4-5),
199-218.
Kortekaas, R., Buus, S., & Florentine, M. (2003). Perceptual weights in auditory level
discrimination. The Journal of the Acoustical Society of America, 113(6), 3306-3322.
120
Kryter, K. D. (1962). Methods for the calculation and use of the articulation index. The
Journal of the Acoustical Society of America, 34(11), 1689-1697.
Lehiste, I., & Peterson, G. E. (1961). Some basic considerations in the analysis of
intonation. The Journal of the Acoustical Society of America, 33(4), 419-425.
Lewis, M. S., Crandell, C. C., Valente, M., & Horn, J. E. (2004). Speech perception in
noise: Directional microphones versus frequency modulation (FM) systems. Journal of
the American Academy of Audiology, 15(6), 426-439.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967).
Perception of the speech code. Psychological review, 74(6), 431.
Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception
revised. Cognition, 21(1), 1-36.
Lim, S. J., & Holt, L. L. (2011). Learning Foreign Sounds in an Alien World: Videogame
Training Improves Non‐Native Speech Categorization. Cognitive science, 35(7), 13901405.
Lisker, L. (1986). “Voicing” in English: a catalogue of acoustic features
signaling/b/versus/p/in trochees. Language and speech, 29(1), 3-11
Liu, R., & Holt, L. L. (2015, August 17). Dimension-Based Statistical Learning of
Vowels. Journal of Experimental Psychology: Human Perception and Performance.
Advance online publication. http://dx.doi.org/10.1037/xhp0000092
Liu, R., Soh, H., & Holt, L. L. (2011). Cue weighting for speech categorization changes
based on regularities in short‐term speech input. The Journal of the Acoustical Society of
America, 129(4), 2659-2659.
Lotto, A. J., Sato, M., & Diehl, R. L. (2004). Mapping the task for the second language
learner: the case of Japanese acquisition of/r/and/l. From sound to sense, 50(2004), C381C386.
Lutfi, R. A. (1989). Informational processing of complex sound. I: Intensity
discrimination. The Journal of the Acoustical Society of America, 86(3), 934-944.
Lutfi, R. A. (1995). Correlation coefficients and correlation ratios as estimates of
observer weights in multiple‐observation tasks. The Journal of the Acoustical Society of
America, 97(2), 1333-1334.
Lutfi, R. A., & Oh, E. L. (1997). Auditory discrimination of material changes in a struckclamped bar. The Journal of the Acoustical Society of America, 102(6), 3647-3656.
121
Lutfi, R. A., & Stoelinga, C. N. (2010). Sensory constraints on auditory identification of
the material and geometric properties of struck bars. The Journal of the Acoustical
Society of America, 127(1), 350-360.
Lutfi, R. A., & Wang, W. (1999). Correlational analysis of acoustic cues for the
discrimination of auditory motion. The Journal of the Acoustical Society of
America, 106(2), 919-928.
Maddox, W., & Chandrasekaran, B. (2014). Tests of a dual-system model of speech
category learning. Bilingualism: Language and cognition, 17(04), 709-728.
Mangold S and Leijon A. Programmable hearing aid with multichannel compression.
Scand Audiol 1979;8: 12 1-126.
Mattys, S. L., & Liss, J. M. (2008). On building models of spoken-word recognition:
When there is as much to learn from natural “oddities” as artificial normality. Perception
& Psychophysics, 70(7), 1235-1242.
Mayo, C., & Turk, A. (2004). Adult–child differences in acoustic cue weighting are
influenced by segmental context: Children are not always perceptually biased toward
transitions. The Journal of the Acoustical Society of America,115(6), 3184-3194.
McGettigan, C., Rosen, S., & Scott, S. K. (2008). Investigating the perception of noisevocoded speech-an individual differences approach. Journal of the Acoustical Society of
America, 123(5), 3330.
Meyer J, Dentel L, Meunier F (2013) Speech Recognition in Natural Background Noise.
PLoS ONE 8(11): e79279. doi:10.1371/journal.pone.0079279
Miceli, G., Gainotti, G., Caltagirone, C., & Masullo, C. (1980). Some aspects of
phonological impairment in aphasia. Brain and language, 11(1), 159-169.
Middelweerd, M. J., Festen, J. M., & Plomp, R. (1990). Difficulties with Speech
Intelligibility in Noise in Spite of a Normal Pure-Tone Audiogram: Original Papers.
International Journal of Audiology, 29(1), 1-7.
Miyawaki, K., Jenkins, J. J., Strange, W., Liberman, A. M., Verbrugge, R., & Fujimura,
O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native
speakers of Japanese and English. Perception & Psychophysics, 18(5), 331-340.
Monson, B. B., Hunter, E. J., Lotto, A. J., & Story, B. H. (2014). The perceptual
significance of high-frequency energy in the human voice. Frontiers in psychology, 5.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing
techniques for text-to-speech synthesis using diphones. Speech communication, 9(5),
453-467.
122
Nittrouer, S. (1996). Discriminability and perceptual weighting of some acoustic cues to
speech perception by 3-year-olds. Journal of Speech, Language, and Hearing
Research, 39(2), 278-297.
Nittrouer, S. (2004). The role of temporal and dynamic signal components in the
perception of syllable-final stop voicing by children and adults. The Journal of the
Acoustical Society of America, 115(4), 1777
Nittrouer, S., & Studdert-Kennedy, M. (1987). The role of coarticulatory effects in the
perception of fricatives by children and adults. Journal of Speech, Language, and
Hearing Research, 30(3), 319-329.
Obleser, J., Eisner, F., & Kotz, S. A. (2008). Bilateral speech comprehension reflects
differential sensitivity to spectral and temporal features. The Journal of
Neuroscience, 28(32), 8116-8123.
Parbery-Clark, A., Strait, D. L., Anderson, S., Hittner, E., & Kraus, N. (2011). Musical
experience and the aging auditory system: implications for cognitive abilities and hearing
speech in noise. PLoS One, 6(5), e18082-e18082.
Peirce, J. W. (2007). PsychoPy - Psychophysics software in Python. Journal of
Neuroscience Methods, 162(1-2):8–13.
Pisoni, D. B., & Cleary, M. (2003). Measures of working memory span and verbal
rehearsal speed in deaf children after cochlear implantation. Ear and hearing, 24(1
Suppl), 106S.
Pisoni, D. D., & Geers, A. E. (2000). Working memory in deaf children with cochlear
implants: Correlations between digit span and measures of spoken language processing.
The Annals of otology, rhinology & laryngology. Supplement, 185, 92.
Port, R. (2007). How are words stored in memory? Beyond phones and phonemes. New
Ideas in Psychology, 25(2), 143-170.
Port, R. F. (2010). Rich memory and distributed phonology. Language Sciences, 32(1),
43-55.
Postman, L., & Tolman, E. C. (1959). Brunswik's probabilistic
functionalism.Psychology: A study of a science, 1, 502-564.
Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception
without traditional speech cues. Science, 212(4497), 947-949.
Rodriguez, G. P., DiSarno, N. J., and Hardiman, C. J. (1990). ‘‘Central auditory
processing in normal-hearing elderly adults,’’ Audiology 29, 85–92.
123
Ronnberg, J., Arlinger, S., Lyxell, B., & Kinnefors, C. (1989). Visual Evoked
PotentialsRelation to Adult Speechreading and Cognitive Function. Journal of Speech,
Language, and Hearing Research, 32(4), 725-735.
Rupp, R. R., & Phillips, D. (1969). Effect of Noise Background on Speech
Discrimination Function in Normal-Hearing Individuals. Journal of Auditory Research,
9(1), 60-63.
Saberi, K., & Perrott, D. R. (1999). Cognitive restoration of reversed
speech.Nature, 398(6730), 760-760.
Schertz, J. (2014). The Structure and Plasticity of Phonetic Categories Across Languages
and Modalities (Unpublished Doctoral Dissertation). University of Arizona, Tucson,
Arizona.
Schertz, J., Cho, T., Lotto, A., Warner, N. (in press) Individual differences in perceptual
adaptability to foreign sound categories. Attention, Perception & Psychophysics.
Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic
cue use in production and perception of a non-native sound contrast.Journal of
Phonetics, 52, 183-204.
Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. (2000). Identification of a pathway for
intelligible speech in the left temporal lobe. Brain, 123(12), 2400-2406.
Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime: User's guide.
Psychology Software Incorporated.
Schoof, T., & Rosen, S. (2014). The role of auditory and cognitive factors in
understanding speech in noise by normal-hearing older listeners. Frontiers in aging
neuroscience, 6.
Shannon, R.V., Galvin, J.J.,3rd, & Baskent, D. (2002). Holes in hearing. Journal ofthe
Association for Research in Otolaryngology, 3, 185–199.
Shannon, R. V. , Zeng, F. G. , Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech
recognition with primarily temporal cues,” Science 0036-8075 270, 303–304.
Smoorenburg, G. F., de Laat, J. A., & Plomp, R. (1982). The effect of noise-induced
hearing loss on the intelligibility of speech in noise. Scandinavian Audiology.
Spahr, A. J., Dorman, M. F., & Loiselle, L. H. (2007). Performance of patients using
different cochlear implant systems: effects of input dynamic range. Ear and hearing,
28(2), 260-275.
124
Stach B. Speerschneider J, and Jerger J. Evaluating the efficacy of automatic signal
processing hearing aids. Hear J. 1987;40(3): 15-19,
Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of
experimental psychology, 18(6), 643.
Surprenant, A. M., & Watson, C. S. (2001). Individual differences in the processing of
speech and nonspeech sounds by normal-hearing listeners. The Journal of the Acoustical
Society of America, 110(4), 2085-2095.
Unsworth, N., Heitz, R. P., Schrock, J. C., & Engle, R. W. (2005). An automated version
of the operation span task. Behavior Research Methods, 37, 498-505.
Van Tasell D, Larsen S, and Fabry D. Effects of an adaptive filter hearing aid on speech
recognition in noise by hearing-impaired subjects. Ear Hear 1988;9: 15-2 I .
Vaughan, N. E., & Letowski, T. (1997). Effects of age, speech rate, and type of test on
temporal auditory processing. Journal of Speech, Language, and Hearing Research,
40(5), 1192-1200.
Venables, W. N., & Smith, D. M. (2010). the R Development Core Team.(2008) An
Introduction to R. Network Theory Limited, Bristol.
Vitela, A. D., Monson, B. B., & Lotto, A. J. (2015). Phoneme categorization relying
solely on high-frequency energy. The Journal of the Acoustical Society of
America, 137(1), EL65-EL70.
Walley, A. C., & Carrell, T. D. (1983). Onset spectra and formant transitions in the
adult’s and child’s perception of place of articulation in stop consonants.The Journal of
the Acoustical Society of America, 73(3), 1011-1022.
Warner, N., Fountain, A., & Tucker, B. V. (2009). Cues to perception of reduced
flaps. The Journal of the Acoustical Society of America, 125(5), 3317-3327.
Warner, N., & Tucker, B. V. (2011). Phonetic variability of stops and flaps in
spontaneous and careful speech. The Journal of the Acoustical Society of
America, 130(3), 1606-1617.
Watson, B. U. (1991). Some relationships between intelligence and auditory
discrimination. Journal of Speech, Language, and Hearing Research, 34(3), 621-627.
Watson, C.S., Johnson, D.M., Lehman, J.R., Kelly, W.J., & Jensen, J.K. (1982). An
auditory discrimination test battery. J Acoust Soc Am 71:S73.
125
Wechsler, D. (1999). Wechsler abbreviated scale of intelligence. Psychological
Corporation.
Wilson RH. (2003) Development of a speech in mul-titalker babble paradigm to assess
word-recognition
performance.J Am Acad Audiol14:453–470
Wilson, R. H., Carnell, C. S., & Cleghorn, A. L. (2007). The Words-in-Noise (WIN) test
with multitalker babble and speech-spectrum noise maskers. Journal of the American
Academy of Audiology, 18(6), 522-529.
Wingfield, A. (1996). Cognitive factors in auditory performance: Context, speed of
processing, and constraints of memory. JOURNAL-AMERICAN ACADEMY OF
AUDIOLOGY, 7, 175-182.
Yoshioka, P., & Thornton, A. R. (1980). Predicting speech discrimination from the
audiometric thresholds. Journal of Speech, Language, and Hearing Research, 23(4), 814827.
126
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement