Why Are Commercials so Loud? - Perception

Why Are Commercials so Loud? - Perception
Why Are Commercials so Loud?- Perception and Modeling of the Loudness of AmplitudeCompressed Speech ...................................... Brian C. J. Moore, Brian R. Glasberg, and Michael A. Stone
According to urban legend, commercials are broadcast with higher loudness levels than programming. An
empirical study confirmed that four-band compressed speech sounds louder than uncompressed speech by
as much as 3 dB when the rms levels are matched. An audio engineer can control only the perceived
loudness of broadcast program material if a loudness meter is available for monitoring the program.
However, loudness models require significant computational power if used in real time.
Smart Digital Loudspeaker Arrays ...........................................................................M. 0. J. Hawksford
With the advent of microminiature transducers, a new class of loudspeaker design fundamentals is
required in order to implement programmable radiation beam directions and beamwidths. The primary
objective of this study was to develop a processing strategy to obtain a target directional radiation from an
array of transducers, each with its own dedicated signal processing. Coherent and diffuse beams can be
obtained simultaneously from the same array over a wide frequency range.
Localization of 3-D Sound Presented through Headphone-Duration of Sound Presentation
and Localization Accuracy ....................................................................................................Fang Chen
Of all the spatial parameters that influence localization accuracy, signal duration is often one of the most
important. When the duration is long enough, approaching four seconds, accuracy using headphones is
comparable to that offree-field or individual HRTFs. The results of this empirical study are consistent
with a wide variety of sound samples. Designers of auditory displays must include signal duration as an
important parameter.
Reconstruction of Mechanically Recorded Sound by Image Processing
...............................................................................• ....................................Vitaliy Fadeyev and Carl Haber
Two-dimensional image processing offers a modern method to reproduce historic mechanical recordings
without using a contact transducer. Moreover, because image processing uses information spread over a
wide area, it is easier to remove noise, scratches, and other defects. In addition to avoiding additional
degradation by contact transducers, optical decoding of groove undulations produces better audio quality.
A contact transducer senses mechanical position at a single point in space and time, an image incorporates
mechanical information spanning a large area. This approach may allow automated preservation of
endangered audio performances of historic value.
Comments on "Analysis of Traditional and Reverberation-Reducing Methods of Room
Equalization" .........................................................................................................John N. Mourjopoulos
Author's Reply .................................................................................................................Louis D. Fielder
Correction to: "Effects on Down-Mix Algorithms on Quality of Surround Sound"
......................................................................................................8. K. Zielinski, F. Rumsey, and S. Bech
AES Standards Committee News...........................................................................................................
Digital audio synchronization; listening tests; Internet audio quality
115th Convention Report, New York.......................................................................................................
Exhibitors .. .. ....... .. ... .. ... .. .. ... .. ... .. ... ... .. ... ... ....... .. ... ..... .. ... ........ .. ... .. ... .. ... ... ... .. .. .. ... .. ... .... ..... ... .. .. .. ... .. .. ..
11th Tokyo Regional Convention Report...............................................................................................
Exhibitors ... .. .. ..... .. ... ..... .. .. ... ..... ...... ..... .. ... .. ..... .. ... .. ... .. .... .. ..... ... .. .. ... .. ... .... .. .. ..... .. .. ..... .. .. ... ... ..... .. .. .....
Education News ............................................................................................................-...........................
Call for Nominations for Board of Governors .......................................................................... ··.··········
Call for Awards Nominations..................................................................................................................
Bylaws: Audio Engineering Society, Inc................................................................................................
Index to Volume 51 ............................ ;......................................................................................................
News of the Sections ...................................... 1271
Sound Track...................................................... 1279
Upcoming Meetings ........................................ 1279
New Products and Developments .................. 1280
Available Literature ......................................... 1281
Membership Information ................................. 1285
Why Are Commercials so Loud? - Perception
and Modeling of the Loudness of
Amplitude-Compressed Speech*
Department of Experimental Psychology, University of Cambridge, Cambridge CB2 3EB, England
121 0
Advertiser Internet Directory ..........................1286
In Memoriam ....................................................1287
AES Special Publications ............................... 1317
Sections Contacts Directory .......................... 1322
AES Conventions and Conferences .............. 1328
The level of broadcast sound is usually limited to prevent overmodulation of the transmitted signal. To increase the loudness of broadcast sounds, especially commercials, fastacting amplitude compression is often applied. This allows the root-mean-square (rms) level
of the sounds to be increased without exceeding the maximum permissible peak level. In
addition, even for a fixed rms level, compression may have an effect on loudness. To assess
whether this was the case, we obtained loudness matches between uncompressed speech
(short phrases) and speech that was subjected to varying degrees of four-band compression.
All rms levels were calculated off line. We found that the compressed speech had a lower rms
level than the uncompressed speech (by up to 3 dB) at the point of equal loudness, which
implies that, at equal rms level, compressed speech sounds louder than uncompressed speech .
The effect increased as the rms level was increased from 50 to 65 to 80 dB SPL. For the
largest amount of compression used here, the compression would allow about a 58% increase
in loudness for a fixed peak level (equivalent to a change in level of about 6 dB). With a slight
modification, the model of loudness described by Glasberg and Moore [1] was able to
account accurately for the results.
It is a common complaint of the general public that
commercials on television and radio are louder than the
normal program material. This is the case despite the
guidelines that are intended to prevent it. For example, the
UK's Independent Television Commission (lTC) has a
code of conduct 1 which states that "advertisements must
not be excessively noisy or strident. Studio transmission
power must not be increased from normal levels during
advertising breaks."
Broadcasters are not allowed to overmodulate the
broadcast radio-frequency signal, and this limits the peak
level that can be transmitted. Hence it is common in the
broadcasting industry to use a peak-level meter to monitor the level of signals that are to be broadcast. However,
the subjective loudness of sounds is not determined solely
by the peak level of those sounds [2]-[7]. Amplitude
compression is one technique that is used by producers of
commercials to manipulate sounds so as to increase their
*Manuscript received 2003 June 19; revised 2003 September 17.
lSee www.itc.org.uk.
J, Audio Eng, Soc,, Vol. 51, No, 12, 2003 December
loudness while leaving the peak level unchanged. Fast
acting compression reduces the peak level of sounds relative to their -:t;_ms level, allowing the rms level to be
increased whilekeeping the peak level the same. This in
itself leads to an increa§~ in loudness. The ITC's code of
conduct states: "To ensu're that subjective volume is consistent with adjacent programming, whilst also preventing excessive loudness changes, highly compressed commercials should, be limited to a Normal Peak of 4 and a
Full Range of 2-4 (measured on a PPM Type Ila, specified in BS6840: Part 10, Programme Level Meters). A
fairly constant average level of sound energy should be
maintained in transitions from prog~ammes to advertising
breaks and vice versa so that listeners do not need to
adjust the volume."
In this context, there is an additional factor that appears
to be largely unexplored. Even for a fixed rms level, compressed sounds with fluctuating envelopes, such as
speech, may differ in loudness from uncompressed
sounds. The exact way in which compression might affect
loudness for complex sounds such as speech is not clear.
Consider a speech signal that is subjected to fast-acting
compression, with the rms level of the compressed speech
adjusted to match that of the original speech. In the compressed speech, the peaks are reduced in level relative to
the original, and the dips are increased in level relative to
the original. The envelope of the compressed signal
therefore fluctuates less over time. The auditory system
itself incorporates a fast-acting compression system in
the cochlea (the inner ear); for mid range sound levels
(30-90 dB SPL) this produces about 3:1 compression
[8]-[12]. The compression is produced by an "active
mechanism," which applies a gain that decreases progressively with increasing input level. As a result, input
signals with a high peak factor (ratio of peak to rms
value), such as the original speech in our example, have
a lower effective level at the output of the physiological
compressor than input signals with a low peak factor,
such as the compressed speech [13]-[15]. This might
lead to a greater loudness for the compressed speech
[13]. On the other hand, signals with high peak factors
do not always sound quieter than signals with lower peak
factors, but the same rms level; Gockel et al. [14], [16]
found that harmonic complex tones with components
added in cosine phase (all starting with a phase of 90°)
sounded louder than complex tones with the same power
spectrum but with components added in random phase,
even though the former have a higher peak factor than
the latter.
As the effect of fast-acting compression on the loudness
of speech was difficult to predict, we decided to conduct
perceptual experiments to determine the effect. The results
are compared to the predictions of the loudness model for
time-varying sounds described by Glasberg and Moore
[1], which was developed from the model for the loudness
of stationary sounds described by Moore et al. [ 17]. It has
been suggested previously that some form of loudness
meter should be used to control the loudness of broadcast
sounds. For example, the ITC code of conduct says: "A
perceived loudness meter may be useful where sound levels might cause problems," and Emmett [18] also advocated the use of "loudness-based perceptual criteria."
However, to our kllowledge the ability of loudness meters
or models to predict the loudness of compressed speech
has not been evaluated previously.
1.1 Speech Stimuli
One set of stimuli was based on a male talker and one
on a female talker. The male-talker stimuli were taken
from track 5 of the CD "Music for Archimedes" produced
by Bang & Olufsen (B&O 101) [19]. The female-talker
stimuli were taken from track 49 of the CD "Sound
Quality Assessment Material" (SQAM) produced by the
European Broadcasting Union. 2 The running speech analog stimuli derived from the CD were sampled via a highquality sound card (Turtle Beach Montego II) and stored
www.ebu.ch; materials also available from http://sound.
on computer disc with 16-bit resolution at a 32-kHz sampling rate. Pauses longer than 140 ms were removed by
hand editing, and the speech was then divided into 2.1-s
segments, eight for the female talker and ten for the male
talker. The stimuli were chosen to be sufficiently long to
give a good impression of overall loudness.
1.2 Method of Compression
The overlap-add method [20] was used to apply the
compression. The data were separated into frames of
length 32 samples (1 ms), each frame overlapping 50%
with its neighbors. The data in each frame were processed,
and then the frames were recombined. Within each frame,
the processing took the following form:
1) The data were "windowed" in the time domain,
using a window designed so as to produce less spectral
smearing than the more conventionally used raised-sine
window. The window used was derived from a Kaiser window with a = 11.6 [21]. This Kaiser window was integrated to give one side of the final desired window; the
other side was its mirror image. The resultant window has
the desirable property that its amplitude-squared value,
when summed with that of neighboring (overlapping)
windows, gives a value of unity at all points. The a value
of 11.6 was chosen because it leads to spectral side lobes
whose level decreases rapidly with increasing frequency
separation from the main lobe, while keeping the width of
the main lobe at a reasonable value.
2) The data were padded with 16 samples of value zero
on either side.
3) A Fourier transform was applied to convert the data
to the frequency domain.
4) The compression (if any) was applied; see below for
5) An inverse Fourier transform was used to convert
back to the time domain.
6) The same window as in 1) was applied again.
A four-band compression system was implemented.
The compressor gain control signals were derived from
unweighted sums of the powers in bins 1-4, 5-10,
11-18, and 19-32, for bands 1, 2, 3, and 4, respectively.
The crossover frequencies between adjacent bands fell at
1.5, 4.5, and 8.5 kHz.
The attack component of the compressor gain control
signal in each band was based on the summed power value
for each frame so that, during an attack (that is, an
increase in level), the gain control signal was updated
every 0.5 ms, without any smoothing except that resulting
from the overlap-add procedure. The release time was initially defined as the time 'C required for the gain to settle to
within 2 dB of its steady value following an abrupt
decrease in level of 25 dB. This is based on an ANSI
measurement standard used in the hearing aid industry
[22]. However, in previous work [23] it was pointed out
that this definition has a drawback when comparing compressors with differing compression ratios. Since the
measure is based on the output of the compressor, for
fixed time constants used to calculate the gain control signal, the value of 'C will vary depending on the compression
ratio used. In the present study these "internal" time conJ. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
stants were kept fixed when the compression ratio was
varied, so the value of 'C, defined according to the ANSI
standard, varied with the compression ratio. For ease of
description, our release time constants were defined in
terms of the value expected for a compression limiter, that
is, a compressor with infinite compression ratio. The
release times defined in this way were 60, 50, 40, and 30
ms for bands 1, 2, 3, and 4, respectively.
At the onset of a change in signal level, the change in
the gain control signal can lag the change in audio level,
leading to "overshoot" or "undershoot" at the output of the
compressor. Since we wished to reduce peak levels relative to rms levels, it was necessary to avoid overshoot
effects associated with abrupt increases in sound level. To
achieve this, the audio signal was delayed by 0.5 ms relative to the gain control signal. This, combined with the
inherent smoothing produced by the overlap-add method,
meant that, in response to an abrupt increase in level, the
output signal level actually started to decrease 1 ms before
the increase in level occurred. Although visibt~ on test
waveforms, in practice this effect was inaudible.
The compression ratios and thresholds were chosen to
ensure a substantial reduction in the peak-to-rms ratio of
the speech signals. However, with large compression
ratios and low compression thresholds, the background
noise of the original recording may become audible,
thereby degrading the audio quality. A compromise was
achieved by setting the compression threshold for each
band 3 dB below the long-term rms level within that band.
This was possible because we could read in the whole signal file to be processed and precalculate the rms value in
each band. This would not be possible in a real-time system, but would be possible in a postproduction suite.
The compression ratios used were chosen on the basis
of the fractional reduction in modulation fr, which was
introduced by Stone and Moore [24]. It is assumed .that a
sinusoidally amplitude modulated signal is applied as the
input to a compressor, and the modulation depth (peak-tovalley ratio in dB) at the output is measured [25], [26].
The measure fr then indicates the relative amount of modulation removed by the compressor. It is defined as
change in modulation depth _ ( cre - 1)
fr = original modulation depth ere
where ere is the effective compression ratio, which varies
with the modulation frequency. When the modulator has a
period that is much longer than the attack and release time
constants,fr reaches a limiting value determined by the static
compression ratio of the compressor. With no compression, fr = 0. For ere = 2, the value of fr is 0.5. For ere =
10, the value of fr is 0.9, implying that there is little temporal variation left in the envelope of the output.
To determine appropriate values of fr to use in the
experiment, sequences of speech and music varying in
length between 18 and 28 s were processed by the compression system described, using a wide range of compression ratios. Histograms were formed of the level distributions of the wide-band signals at the output, using
rectangular time windows varying from 0.1 to 31 ms in
J. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
duration. The peak value of each histogram was then calculated in decibels relative to the rms value. There was a
near linear relationship between this ratio and the value of
fr for very low modulation rates. This relationship was
maintained independent of the time window employed to
form the histograms, over the range specified.
Consequently to achieve a uniform spread of values in the
peak-to-rms ratio, the limiting values offr were chosen to
be 0, 0.3, 0.6, and 0.9. This corresponds to compression
ratios of 1, 1.43, 2.5, and 10, respectively. The same compression ratio was used in each band. The peak-to-rms
ratios for the broad-band speech were 17.6, 16.0, 14.0, and
12.6 dB for fr = 0, 0.3, 0.6, and 0.9, respectively. These
values were calculated on a sample-by-sample basis.
The gain control signals were converted to gains
according to the compression ratio in use. The four gains
(corresponding to the center frequencies of the four bands)
for each frame were interpolated across frequency to produce values for the 32 bins of the Fourier transform.
Interpolation was on a linear gain versus linear frequency
scale. Gain values for frequencies below the center frequency of band 1 or above the center frequency of band 4
were set to the gain values at the respective center frequencies. These gains were then applied to the complex
magnitude values in the Fourier transform bins, before
taking the inverse Fourier transform (step 5) above. The
overall gain following each compression band was
adjusted so that the long-term average spectra of the input
and output signals were essentially identical.
1.3 .Equipment
The stimuli were replayed via a Tucker-Davis
Technologies (TDT) 16-bit digital-to-analog converter
(DD1), and stimulus levels were controlled by a TDT PA4
programmable attenuator, which was under computer control. The output of the PA4 was fed to a headphone amplifier (TDT HB6) and then via a Hatfield 2125 manual
attepuator to both earpieces of a pair of Sennheiser HD580
headphones. Subjects were seated in a double-walled
sound-attenuafrn~. chamber.
1.4 Procedure
A loudness matching procedure was used to determine
the rms levels. at which the compressed and uncompressed
speech sounded equally loud. A given compressed segment was always matched with the corresponding uncompressed segment (that is, with the same sequence of
words). The segment used was chosen randomly for each
run (each loudness match), except that all conditions with
a given gender of the speaker (male or female) were completed before testing with the other gender. Within a given
run, either the compressed or the uncompressed speech
was fixed in level, and the subject varied the level of the
other sound to determine the level corresponding to equal
loudness. The compressed and uncompressed speech were
presented in alternation, with a 500-ms interval between
the fixed stimulus and the variable stimulus. Following the
variable stimulus, there was an 800-ms interval during
which a light was turned on. During this interval, the subject could alter the level of the variable stimulus via but1125
tons on the response box.
The level of the fixed stimulus was either 50, 65, or 80
dB SPL. The starting level of the variable stimulus was
chosen randomly for each run within a range of ± 10 dB
around the level of the fixed stimulus. Subjects were told
to press the level button if the second (variable) stimulus
in the presentation cycle appeared louder than the first
(fixed) stimulus. This resulted in a decrease in level of the
variable stimulus. Subjects were told to press the right button if the variable stimulus appeared softer. This resulted
in an increase in level of the variable stimulus. If no button was pressed during the 800-ms interval, the level of
the variable stimulus stayed the same. A change from
pressing the left button to pressing the right button, or vice
versa, was termed a turnaround. The step size for the
change in level was 3 dB until two turnarounds had
occurred and was 1 dB thereafter. Subjects were instructed
to "bracket" the point of equa1loudness several times by
making the variable stimulus clearly louder than the fixed
stimulus and then clearly softer, before using the buttons
to make the stimuli equal in loudness. When subjects were
satisfied with a loudness match, they indicated this by
pressing a third button, and the level of the variable stimulus at this point was taken as the matching level. The
computer did not accept this button press until four turnarounds had occurred. To reduce bias effects, for each
condition three runs were obtained with the compressed
stimulus varied and three runs were obtained with the
uncompressed stimulus varied.
The stimuli were characterized by four independent
1) Uncompressed or compressed using one of three
compression ratios (1.43, 2.5, and 10.0)
2) The sound level of the fixed stimulus (50, 65, or 80
3) The gender of the speaker (male or female)
4) Whether the compressed or the uncompressed stimulus was varied in level.
To reduce effects related to experience, fatigue, and so
on, the order of testing these variables was counterbalanced across subjects.
1.5 Subjects
The six subjects were all university students with no
reported hearing disorders. There were three females and
three males, all aged between 19 and 22 years. Each subject was initially trained for about 15 min on a selection of
conditions, to ensure that they were familiar with the task.
All gave stable results after this training period. The
experiment proper was conducted over three sessions for
each subject, each lasting about one hour. The interval
between testing sessions was approximately 2 days.
stimulus). If this difference is positive, it implies that, at
equal rms level, the compressed stimulus would sound
louder than the uncompressed stimulus. If the difference is
negative, it implies the opposite. In fact, in the mean data
the difference was always positive, implying that compression leads to an increase in loudness for a fixed rms
Initially a within-subjects analysis of variance
(ANOVA) was conducted with the following factors: level
of the fixed stimulus, amount of compression of the compressed stimulus, gender of the talker, and type of fixed
stimulus-uncompressed or compressed. The ANOVA
showed no significant effect of the gender of the talker.
Therefore this factor is ignored in the rest of this paper.
The ANOVA showed a significant effect of whether the
fixed stimulus was compressed or uncompressed; F(1, 5) =
22.28, p < 0.01. The mean difference in level at the point
of equal loudness was 2.1 dB when the uncompressed
stimulus was fixed in level and 1.3 dB when the compressed stimulus was fixed in level. This reflects a bias
effect, which has been observed previously in loudnessmatching experiments [6], [14], [16]. This bias effect is
not of importance for this study, and we assumed that a
reasonably unbiased estimate of the difference in level at
the point of equal loudness could be obtained by averaging results across the two types of condition (compressed
stimulus fixed and uncompressed stimulus fixed). In what
follows, all results are averaged in this way as well as
across the gender of the talker.
Fig. 1 shows the difference in level at the point of equal
loudness plotted as a function of the compression ratio
(top axis) and fractional reduction in modulation fr (bottom axis), with the level of the fixed stimulus as the
parameter. (The dashed lines are predictions of the loudness model, which will be described later.) The results
indicate that the difference in level at the point of equal
loudness increases with increasing compression ratio.
This was confirmed by the ANOVA, which showed a significant main effect of the compression ratio: F(2, 10) =
59.5, p < 0.001. The results also show that the effect of
compression is greater at high levels; the difference in
level at the point of equal loudness increased with increasing level. This was confirmed by the ANOVA; F(2, 10) =
8.81, p < 0.01. There was also a significant interaction of
level and compression ratio; F(4, 20) = 2.92, p = 0.045.
This is consistent with the observation that the three
curves in ~Fig. 1 ar~ not parallel. The effect of level was
somewhat greater for the highest compression ratio than
for the two lower compression ratios._For the highest level
and the highest compression ratio, the mean effect of compression on the level difference was 3 dB. For the lowest
level and lowest compression ratio, the mean effect was
only 0.6 dB.
To assess the effect of the different variables, the results
for each condition were expressed as the difference in rms
level between the uncompressed stimulus and the compressed stimulus at the point of equal loudness, that is, as
(level of uncompressed stimulus)- (level of compressed
impression: the listener can judge the short-term loudness,
for example, the loudness of a specific syllable; or the listener can judge the overall loudness of a relatively long
segment, such as a sentence. We will refer to the latter as
the long-term loudness. The model computes both shortterm and long-term loudness. In our experiment, the subjects compared the overall loudness of relatively long
sample of speech, including several words. We assume,
therefore, that the long-term loudness was being judged.
We give first a brief description of the model.
3.1 Outline of the Model
The stages of the model are as follows.
1) A finite impulse response filter representing the
transfer from the sound field through the outer and middle
ear to the cochlea. For the present analysis, the transfer
response of the outer ear was chosen to represent that
obtained in a diffuse sound field [27], [28], as the
Sennheiser HD580 earphones are designed to mimic such
a response.
2) Calculation of the short-term spectrum using the fast
Fourier transform (FFT). To give adequate spectral resolution at low frequencies, combined with adequate temporal
resolution at high frequencies, six FFTs are calculated in
parallel, using longer signal segments for low frequencies
and shorter segments for higher frequencies.
3) Calculation of an excitation pattern from the physical spectrum [17], [29], [30].
4) Transformation of the excitation pattern to a specific
loudness pattern [17].
5) Determination of the area under the specific loudness pattern. This gives a value for the "instantaneous"
loudness, which is updated every 1 ms.
We assume that the instantaneous loudness is an intervening variable which is not available for conscious per-
en 3.5
> 2.5
To assess whether the loudness model of Glasberg and
Moore [1] could account for the results, the stimuli used
in the experiment were applied as input to the model. For
sounds like speech, there are two aspects to the loudness
J. Audio Eng. Soc., Vol. 51, No. 12,2003 December
where aa is a constant.
If Sn < S' n- 1 (corresponding to a release, as the instantaneous loudness is less than the short-term loudness),
where ar is a constant.
The values of aa and ar were set to 0.045 and 0.02,
respectively. (These values are appropriate when the
instantaneous loudness is updated every 1 ms.) The value
of aa was chosen to give reasonable predictions for the
variation.of loudness with duration [31]. The value of ar
was chosen to give reasonable predictions of the overall
loudness of amplitude-modulated sounds [4], [6], [7]. The
Compression ratio
b. 80 dB
D 65 dB
0 50 dB
--- ---
.....c: 1.0
ception. The perception of loudness depends on the summation or integration of neural activity over times longer
than 1 ms.
The short-term perceived loudness is calculated using a
form of temporal integration or averaging of the instantaneous loudness which resembles the way that a control
signal is generated in a dynamic-range compression circuit. This was implemented in the following way. We
define S' n as the short-term loudness at the time corresponding to the nth time frame (updated every 1 ms), sn as
the instantaneous loudness at the nth time frame, and
S'n- 1 as the short-term loudness at the time corresponding
If Sn > S' n- 1 (corresponding to an attack, as the instantaneous loudness at frame n is greater than the short-term
loudness at the previous frame), then
Fractional reduction in modulation
Fig. 1. Difference in level of uncompressed and compressed speech at the point of equal loudness, plotted as a function of compression ratio (top axis) and fractional reduction in modulation (bottom axis). Positive numbers mean that uncompressed speech had a
higher rms level than compressed speech at the point of equal loudness. Parameter is the overall level of the fixed stimulus. Data are
collapsed across the gender of the talker and across conditions where the fixed stimulus was uncompressed or compressed. Error bars
show ± 1 standard error across subjects.--- Predictions of Glasberg and Moore loudness model [1].
J. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
fact that aa is greater than ar means that the short-term
loudness can increase relatively quickly when a sound is
turned on, but it takes somewhat longer to decay when the
sound is turned off.
The long-term loudness is calculated from the shortterm loudness, again using a form of temporal integration
resembling the operation of a compression circuit. Denote
the long-term loudness at the time corresponding to frame
n as S"n· If S'n > S"n-I (corresponding to an attack, as the
short-term loudness at frame n is greater than the longterm loudness at the previous frame), then
where aa1 is a constant.
If S' n < S"n-I (corresponding to a release, as the shortterm loudness is less than the long-term loudness), then
S" n = arl S' n + (1 - ari) S"n- I
where ari is a constant.
The values of aa1 and ari were set to 0.01 and 0.0005,
respectively. The value of aai was chosen partly to give correct predictions of the loudness of amplitude-modulated
sounds as a function of the modulation rate [6]. However,
the value of CXr! was not well defined by the data available
at the time, and its value was chosen somewhat arbitrarily.
3.2 Predicting the Data
To predict the data, we mimicked what was done in the
experiment. For each condition (for example, for a 65-dB
SPL uncompressed fixed stimulus, and a variable-level
compressed stimulus with a compression ratio of 2.5),
each of the 18 2.1-s segments of the fixed-level stimulus
was used as input to the model. The long-term loudness
was calculated. We found that, despite the relatively slow
averaging process used to calculate the long-term loudness, the long-term loudness still fluctuated somewhat
both within a segment (even after the first second) and
across segments. We therefore obtained an overall loudness estimate for the fixed signal by averaging the longterm loudness of each segment over all times for which its
value exceeded the value corresponding to absolute
threshold, and then averaging across segments. All averaging was done in sones. The resulting overall loudness
estimate is called Loverall.
The next stage was to pick a 2.1-s segment of the
variable-level stimulus and use that as input to the model.
The level of this segment was varied in 1-dB steps to find
two levels, separated by 1 dB, for which the average longterm loudness predicted by the model for that segment
bracketed the value Loveran· The level of the variable stimulus leading to a long-term loudness equal to Loveraii was
then estimated by interpolation. This procedure was
repeated for each segment in turn, and the levels were
averaged across all18 segments to give an estimate of the
level of the variable stimulus required for equal loudness.
Although the model as described earlier gave reasonable fits to the data, we found that the fits could be
improved by changing the value of the constant arl from
0.0005 to 0.005. Recall that the value of this constant was
chosen somewhat arbitrarily, as its value was not well
defined by existing data. This change did not markedly
alter the predictions of the model for the data analyzed in
our earlier paper [1]. Hence in what follows, we use the
revised value of 0.005.
The dashed lines in Fig. 1 show the predictions of the
model. It can be seen that the model predicts the data very
well; it predicts both the effect of increasing the compression ratio and the effect of level. Fig. 2 compares the
experimentally measured differences in level at the point
of equal loudness with the predictions of the model. It is
clear that there is a very good correspondence between the
two. The largest deviation between the data and the predictions is 0.6 dB, and this occurs for a condition (the
highest compression ratio and the lowest level) where the
experimental data look somewhat out of line (see Fig. 1).
There are three major outcomes of our study. First,
multiband compression of speech leads to an increase in
loudness for a fixed rms level, and the size of the effect
increases progressively with increasing compression ratio.
Second, the effect of compression on the level difference
required for equal loudness of compressed and uncompressed speech increases with increasing overall level.
Third, a slightly modified version of the loudness model
described by Glasberg and Moore [1] can account for the
results very well. We consider next the interpretation of
these results and their practical implications.
4.1 Origin of the Effect of Compression on
It is of interest to consider why, for a fixed rms level,
compressed speech has a greater long-term loudness than
uncompressed speech. To gain some insight into this, we
considered the output of the various stages of the loudness
model. Fig. 3 (a) shows the waveform of a 900-ms segment of uncompressed speech, which was embedded
within a longer segment. Fig. 3 (b) shows that same segment following 10: 1 compression. In the compressed
speech, the major peaks are reduced in amplitude relative
to those for the uncompressed speech (such as around 110
and 740 ms), while the low-amplitude portions are
increased (such as around 30 and 500 ms). The amplitude
of the compressed speech is above that of the uncompressed speech more often than the reverse.
Fig. 3 (c) shows the instantaneous loudness of the
uncompressed speech (solid line) and the compressed
speech (dashed line). As would be expected, at times when
the amplitude of the uncompressed speech exceeds that of
the compressed speech (such as around 110 and 740 ms),
the uncompressed speech has a higher instantaneous loudness. Conversely, at times when the amplitude of the
uncompressed speech is below that of the compressed
speech (such as around 30 and 500 ms), the uncompressed
speech has a lower instantaneous loudness. The compressed speech has a greater instantaneous loudness than
the uncompressed speech for a relatively high proportion
J. Audio Eng. Soc., Vol. 51, No. 12,2003 December
,, '
_ -li/,'
«i 2.5
lf_____ ,l[,,''lt :[
..... 2.0
,' '
Fig. 2. Difference in level of uncompressed and compressed speech at the point of equal loudness, plotted against predictions of
Glasberg and Moore loudness model [1]. Error bars show ± 1 standard error across subjects. --- shows where points would lie if model
were perfectly accurate, and if there were no errors in the data.
~ 40
§ 30
en 20
~ 40
§ 30
en 20
~ 30
§ 20
en 10
100 200 300 400 500 600 700 BOO 900
Time, ms
Fig. 3. (a) Waveform of a segment of uncompressed speech. (b) Same segment after 10: 1 four-band compression. (c)-(e)
Instantaneous, short-term, and long-term loudness of uncompressed (-) and compressed (---) speech.
J. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
of the time.
Fig. 3 (d) shows the short-term loudness of the uncompressed speech (solid line) and the compressed speech
(dashed line). As a result of the averaging used to produce
the short-term loudness (see Section 3.1), the short-term
loudness is mostly higher for the compressed than for the
uncompressed speech. The only exception in the example
shown is around 800 ms, where the uncompressed signal
has a relatively long high-amplitude portion.
Fig. 3 (e) shows the long-term loudness of the uncompressed speech (solid line) and the compressed speech
(dashed line). Note that the long-term loudness does not
start at a very low value, because the sample of speech was
embedded within a longer segment. The long-term loudness is nearly always higher for the compressed speech,
except around 850-900 ms, where the long-term loudness
is almost the same for the uncompressed and the compressed speech.
We can conclude that the greater long-term loudness of
the compressed speech arises from a combination of
effects: 1) The amplitude of the compressed speech, and
hence its instantaneous loudness, is above that of the
uncompressed speech for much of the time. 2) The averaging used to compute the short-term and long-term loudness results in a greater long-term loudness for the compressed speech nearly all of the time.
4.2 The Effect of Level
The difference in level between uncompressed and
compressed stimuli at the point of equal loudness
increased with increasing level. However, according to the
model, the effect of compression of loudness per se (in
sones) was almost independent of level. For example, for
the highest compression ratio, the loudness of the compressed stimuli was, on average, 1.23, 1.22, and 1.26 times
that of the uncompressed stimuli, for overall levels of 50,
65, and 80 dB SPL, respectively. These two findings can
be reconciled by taking into account the fact that, when
loudness in sones is plotted on a logarithmic scale as a
function of sound level in dB SPL, the resulting function
has a greater slope at low levels than at high levels. Hence
a change in loudness by a fixed factor corresponds to a
greater change in level at high levels than at low levels.
4.3 Implications for the Use of Compression in
Multiband compression of speech makes the speech
appear louder than uncompressed speech of the same rms
level. In addition, compression allows the rms level of
speech to be increased without increasing the peak level.
These two effects combine to give a marked increase in
loudness while keeping within the prescribed limits of
peak level. For the highest amount of compression used
here, with a compression ratio of 10: 1 , the rms level can
be increased by 5 dB while keeping the peak level the
same as for the uncompressed speech, and the compression itself leads to an effect on loudness equivalent to an
increase in level of 1.5 dB (at an overall level of 50 dB
SPL) to 3 dB (at an overall level of 80 dB SPL). According
to our loudness model, uncompressed speech with a level
of 80 dB SPL would lead to a long-term loudness of 23.3
sones (averaged over 2.1-s segments), while highly compressed speech with a level of 85 dB SPL would lead to a
long-term loudness of 36.9 sones, a 58% increase in loudness. This is large enough to be easily noticed and annoying. In a nonideal listening environment, where background sounds were present, the partial loudness of the
program material would be judged. The partial loudness
would be expected to change by an even greater factor, as
partial loudness changes more rapidly with sound level
than unmasked loudness [17], [32].
4.4 The Need for a Loudness Meter
Our results make it very clear that signals with similar
peak levels can differ markedly in loudness. This means
that meters based on the measurement of peak levels cannot give an accurate indication of loudness. This is already
recognized in the broadcasting industry. For example, the
ITC guidelines include a table of recommended peak values for different types of program materials, as measured
using a PPM type Ila meter as specified in BS6840: Part
10. For uncompressed speech materials, such as talk
shows, news, and drama, the recommended peak values
are 5 for normal peaks and 1-6 for full range. For programs or commercials containing a high degree of compression, the recommended values are 4 for normal peaks
and 2-4 for full range. However, these guidelines are necessarily approximate, and they do not take into account the
type or amount of compression applied (for example
multiband versus single-band, fast versus slow compression, low versus high compression ratio).
It seems clear that a meter giving an accurate indication
of loudness as perceived by human listeners would be far
preferable to a meter based on the measurement of peak
levels, combined with approximate correction factors for
different types of program material. It remains to be seen
whether our loudness model gives accurate estimates of
loudness for other types of material that are used in commercials (such as music and car sounds). However, the
results presented here for speech stimuli support the use of
our model as a candidate for a loudness meter.
It should be noted that our loudness model has not yet
been implemented in a real-time loudness meter. All loudness calculations are performed on stored waveform files
in non real time. We have previously described a real-time
loudness meter [33] based on the loudness model
described by Moore and Glasberg [34]. While that meter
gives reasonably accurate estimates of the way that loudness changes with duration, it is less accurate in predicting
the loudness of amplitude-modulated sounds than the nonreal-time model evaluated in this paper [1]. The more
recent model [1] requires considerably more computational power than the model described in [33] in order to
be implemented in real time.
1) Fast-acting compression of speech leads to an
increase in loudness for a fixed rms level, and the size of
the effect increases progressively with increasing comJ. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
pression ratio.
2) The effect of compression on the level difference
required for equal loudness of compressed and uncompressed speech increases with increasing overall level. For
the highest level and highest compression ratio used, the
difference was 3 dB.
3) A slightly modified version of the loudness model
described by Glasberg and Moore [1] can account for the
results very well.
4) The use of compression can allow an increase in
loudness of about 58% while maintaining the same peak
5) The use of peak level meters in broadcasting does
not allow adequate control of the loudness of commercials. A loudness meter should be used for this purpose.
This work was supported by the Medical Research
Council (UK). We thank John McDonald and Dan Barry
for gathering the data reported here. We also. thank two
anonymous reviewers for helpful comments.
[1] B. R. Glasberg and B. C. J. Moore, "A Model of
Loudness Applicable to Time-Varying Sounds," J. Audio
Eng. Soc., vol. 50, pp. 331-342 (2002 May).
[2] M. M. Boone, "Loudness Measurements on Pure
Tone and Broad Band Impulsive Sounds," Acustica, vol.
29, pp. 198-204 (1973).
[3] R. P. Hellman, "Perceived Magnitude of Two-ToneNoise Complexes: Loudness, Annoyance, and Noisiness,"
J. Acoust. Soc. Am., vol. 77, pp. 1497-1504 (1985).
[4] C. Zhang and F.-G. Zeng, "Loudness of Dynamic
Stimuli in Acoustic and Electric Hearing," J Acoust. Soc.
Am., vol. 102, pp. 2925-2934 (1997).
[5] B. C. J. Moore, S. Launer, D. Vickers, and T. Baer,
"Loudness of Modulated Sounds as a Function of
Modulation Rate, Modulation Depth, Modulation
Waveform and Overall Level," in Psychophysical and
Physiological Advances in Hearing, A. R. Palmer, A.
Rees, A. Q. Summerfield, and R. Meddis, Eds. (Whurr,
London, 1998), pp. 465-471.
[6] B. C. J. Moore, D. A. Vickers, T. Baer, and S.
Launer, "Factors Affecting the Loudness of Modulated
Sounds," J. Acoust. Soc. Am., vol. 105, pp. 2757-2772
[7] G. Gimm, V. Hohmann, and J. L. Verhey,
"Loudness of Fluctuating Sounds," Acta Acustica. Acustica, vol. 88, pp. 359-368 (2002).
[8] P. M. Sellick, R. Patuzzi, and B. M. Johnstone,
"Measurement of Basilar Membrane Motion in the
Guinea Pig Using the Mossbauer Technique," J Acoust.
Soc. Am., vol. 72, pp. 131-141 (1982).
[9] L. Robles, M.A. Ruggero, and N. C. Rich, "Basilar
Membrane Mechanics at the Base of the Chinchilla
Cochlea. I. Input-Output Functions, Tuning Curves, and
Response Phases," J. Acoust. Soc. Am., vol. 80, pp.
1364-1374 (1986).
J. Audio Eng. Soc., Vol. 51, No. 12,2003 December
[10] L. Robles and M. A. Ruggero, "Mechanics of the
Mammalian Cochlea," Physiol. Rev., vol. 81, pp.
1305-1352 (2001).
[11] A. J. Oxenham and C. J. Plack, "A Behavioral
Measure of Basilar-Membrane Nonlinearity in Listeners
with Normal and Impaired Hearing," J. Acoust. Soc. Am.,
vol. 101, pp. 3666-3675 (1997).
[12] B. C. J. Moore, An Introduction to the Psychology
of Hearing, 5th ed. (Academic Press, San Diego, 2003).
[13] R. P. Carlyon and A. J. Datta, "Excitation
Produced by Schroeder-Phase Complexes: Evidence for
Fast-Acting Compression in the Auditory System," J.
Acoust. Soc. Am., vol. 101, pp. 3636-3647 (1997).
[14] H. Gockel, B. C. J. Moore, and R. D. Patterson,
"Louder Sounds Can Produce Less Forward Masking:
Effects of Component Phase in Complex Tones," J.
Acoust. Soc. Am., vol. 114, pp. 978-990 (2003).
[15] B. C. J. Moore, T. H. Stainsby and E. Tarasewicz,
"Effects of Masker Component Phase on the Forward
Masking Produced by Complex Tones in Normally
Hearing and Hearing-Impaired Subjects," J. Acoust. Soc.
Am., (submitted, 2003).
[16] H. Gockel, B. C. J. Moore, and R. D. Patterson,
"Influence of Component Phase on the Loudness of
Complex Tones," Acustica-Acta Acustica, vol. 88, pp.
369-377 (2002).
[17] B. C. J. Moore, B. R. Glasberg, and T. Baer, "A
Model for the Prediction of Thresholds, Loudness, and
Partial Loudness," J. Audio Eng. Soc., vol. 45, pp.
224-240 (1997 Apr.).
[18] J. Emmett, "Audio Levels in the New World of
Digital Systems," EBU Tech. Rev., vol. 293, pp. 1-5
[19] V. Hansen and G. Munch, "Making Recordings
for Simulation Tests in the Archimedes Project," J. Audio
Eng. Soc. (Engineering Reports), Vol. 39, pp. 768-774
(1991 Oct.).
[20] J. B. Allen, "Short Term Spectral Analysis,
Synthesis aqd Modification by Discrete Fourier
Transform," IEE!j: Trans. Acoust. Speech, Signal Process.,
vol. "25, pp. 235-238 (1977).
[21] L. D. Fielder, M. Bosi, G. Davidson, M. Davis, C.
Todd, and S. Vernon, "AC-2 and AC3: Low-Complexity
Transform-Based Audio Coding," in Collected Papers on
Digital Audio Bit-Rate Reduction, N. Gilchrist and C.
Grewin, Eds. (Audio Engineering Society, New York,
1996), pp. 54-72.
[22] ANSI S3.22-1996, "Specification of Hearing Aid
Characteristics," American National Standards Institute,
New York (1996) .
[23] M.A. Stone, B. C. J. Moore, J. I. Alcantara, and
B. R. Glasberg, "Comparison of Different Forms of
Compression Using Wearable Digital Hearing Aids," J.
Acoust. Soc. Am., vol. 106, pp. 3603-3619 (1999).
[24] M.A. Stone and B. C. J. Moore, "Effect of the
Speed of a Single-Channel Dynamic Range Compressor
on Intelligibility in a Competing Speech Task," J. Acoust.
Soc. Am., vol. 114, pp. 1023-1034 (2003).
[25] M. A. Stone and B. C. J. Moore, "Syllabic
Compression: Effective Compression Ratios for Signals
Modulated at Different Rates," Brit. J. Audio!., vol. 26, pp.
351-361 (1992).
[26] B. C. J. Moore, M.A. Stone, and J. I. Alcantara,
"Comparison of the Electroacoustic Characteristics of
Five Hearing Aids," Brit. J. Audio!., vol. 35, pp. 307-325
[27] G. Kuhn, "The Pressure Transformation from a
Diffuse Field to the External Ear and to the Body and
Head Surface," J. Acoust. Soc. Am., vol. 65, pp. 991-1000
[28] M. C. Killion, E. H. Berger, and R. A. Nuss,
"Diffuse Field Response of the Ear," J. Acoust. Soc. Am.,
vol. 81, p. S75 (1987).
[29] B. C. J. Moore and B. R. Glasberg, "Suggested
Formulae for Calculating Auditory-Filter Bandwidths and
Excitation Patterns," J. Acoust. Soc. Am., vol. 74, pp.
750-753 (1983).
[30] B. R. G1asberg and B. C. J. Moore, "Derivation of
Auditory Filter Shapes from Notched-Noise Data," Hear.
Res., vol. 47, pp. 103-138 (1990).
[31] B. Scharf, "Loudness," in Handbook of
Perception, vol. IV. Hearing, E. C. Carterette and M. P.
Friedman, Eds. (Academic Press, New York, 1978), pp.
[32] E. Zwicker, "Uber psychologische und methodische Grundlagen der Lautheit," Acustica, vol. 8, pp.
237-258 (1958).
[33] M.A. Stone, B. C. J. Moore and B. R. Glasberg,
"A Real-Time DSP-Based Loudness Meter," in
Contributions to Psychological Acoustics, A. Schick and
M. Klatte, Eds. (Bibliotheks- und Informationssystem der
UniversiHit Oldenburg, Oldenburg, Germany, 1997), pp.
[34] B. C. J. Moore and B. R. Glasberg, "A Revision of
Zwicker's Loudness Model," Acustica-Acta Acustica,
vol. 82, pp. 335-345 (1996).
Smart Digital Loudspeaker Arrays*
M. 0. J. HAWKSFORD, AES Fellow
Centre for Audio Research and Engineering, University of Essex, Colchester, C04 3SQ, UK
A theory of smart loudspeaker arrays is described where a modified Fourier technique
yields complex filter coefficients to determine the broad-band radiation characteristics of a
uniform array of micro drive units. Beamwidth and direction are individually programmable
over a 180° arc, where multiple agile and steerable beams carrying dissimilar signals can be
accommodated. A novel method of stochastic filter design is also presented, which endows
the directional array with diffuse ra$liation properties.
B. R. Glasberg
M.A. Stone
Brian R. Glasberg received B.Sc. and Ph.D. degrees in
applied chemistry from Salford University in 1968 and
1972, respectively.
He then worked as a chemist refining precious metals.
He spent some time as a researcher of process control
before joining the laboratory of Brian C. J. Moore at the
University of Cambridge, UK, initially as a research associate and then as a senior research associate.
Dr. Glasberg's research focuses on the perception of
sound in both normally hearing and hearing-impaired
people. He also works on the development and evaluation
of hearing aids, especially digital hearing aids. He is a
member of the Acoustical Society of America and has
published 95 research papers and book chapters.
He joined the BBC Engineering Research Department,
where he worked on an early digital audio editor, video bitrate reduction schemes, and high-definition television
(HDTV) scanning methods for both the camera and the display. The HDTV work involved subjective assessment of
picture quality from which he became interested in the
human interface to technology. In 1988 he joined Brian
Moore's Psychoacoustics Group. There he looked at signal
processing strategies for both analog and digital hearing
aids. Some of his recent work has been on the subjective and
objective effects on speech production and perception of the
processing delays in digital hearing aids. Another part of his
work has been characterizing the behavior of dynamic range
compressors and investigating their effects on speech intelligibility. This work guided the selection of the parameters
of the compression system used in this paper.
Michael Stone received a B.A. degree in engineering
science from Cambridge University (UK) in 1982. He
received a Ph.D. degree in 1995 for his thesis on "Spectral
Enhancement for the Hearing Impaired."
This paper considers from a theoretical stance the fundamental requirements of a programmable polar response,
digital loudspeaker array, or smart digital loudspeaker
array (SDLA), which consists of either one-dimensional
or a two-dimensional array of micro radiating elements:
The principal problem addressed here is the design of a set
of digital filters which together with a uniform array of
small drive units, achieve a well-defined directional beam
that can both be steered over a 180° arc and be specified
in terms of beamwidth such that it remains constant with
the steering angle. Intrinsic and critical to the SDLA is the
requirement that the beam parameters remain stable over
a broad frequency range. In addition to addressing the
problem of coherent radiation, the theory is extended to
include the synthesis of directionally controllable diffuse
radiation that is similar although not identical to the class
Presented at the 11 Oth Convention of the Audio Engineering
Society, Amsterdam, The Netherlands, 2001 May 12-15;
revised 2003 September 29. This study was undertaken for NXT
Transducers pic, U~.
of sound field produced by a distributed-mode loudspeaker (DML) [1], [2]. DML behavior can be emulated
using an array of discrete radiating elements with excitation signals calculated to model panel surface wave propagation and boundary reflections using techniques such as
finite element vibration analysis [3]. However, for the
SDLA a different approach is taken where each element
drive signal is derived by convolution of the input signal
with im .element-specific but stochastically independent
temporally diffuse impulse response (TDI) [4]. Each TDI
is calculated to have a constant-magnitude response but a
unique random-phase response, where for loudspeaker
applications it is formed asymmetrically to have a rapid
initial response and a decaying "tail" exhibiting a noiselike character.
The conceptqal structure of an SDLA is shown in Fig.
1, where each riurcodriver within the array is addressed
directly by a digital sig11$11 that has been filtered adaptively
to allow the polar respo.nse to be specified and controlled
dynamically. It is proposed to configure the transducer
array with a large number of nominally identical acoustic
radiating elements, where the overall array size and
interelement spacing (referred to here as interspacing)
The biography of Brian Moore was published in the
2003 November issue of the Journal.
Oversampling and noise shaping
Fig. 1. Conceptual model of smart digital loudspeaker array (SDLA).
J. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
J. Audio Eng. Soc., Vol. 51, No. 12, 2003 December
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF