Chapter 1: Introduction to digital audio

Chapter 1: Introduction to digital audio
Chapter 1: Introduction to digital audio
Applications: audio players (e.g. MP3), DVD-audio, digital
audio broadcast, music synthesizer, digital amplifier and
equalizer, 3D sound synthesis
Properties of Sound
Sound is what we call the air pressure variations generated by
moving objects. It is a continuous acoustics wave that travels
through the air .
Sound waves are longitudinal waves, which means the particle
movement caused by the wave is parallel to the direction of
propagation. They propagate in three dimensions with
compressible gas as medium. Sound waves have normal wave
properties (reflection, refraction, diffraction, etc.)
Properties of Sound
Sound intensity is directly related to the acoustic power per unit
area, we must consider both the pressure and velocity aspects of
sound waves in order to explain the interaction of a sound wave
with a microphone or an ear. The relationship between pressure
change and volume change is described by the bulk modulus (B) of
the gas:
is the change in pressure,
is the percent change in
volume. The bulk modulus and density determine the velocity of
where ρ is density. The speed of sound is about 330m/s at room
temperature and sea level atmosphere.
The sound waves propagate outward as an expanding sphere of
concentric pressure variations. The pressure variation decreases in
amplitude as it occupies a larger and larger volume of air.
Properties of Sound
When we hear sounds, the perceived loudness is determined by
the amount of power transmitted from the source to our ears.
The intensity of a sound is the power per unit area received at
the sensing device. The intensity received from a point source
decreases with the square of the distance from the source.
We commonly use sound pressure levels (SPL) as the measure
of sound amplitude. Unlike intensity, sound pressure levels
decrease linearly with distance from the source. This results in a
6 dB decrease in SPL for a doubling of distance
Human has an audible SPL range from 0dB to about 140 dB
within the frequency range from 50Hz to 20kHz.
Properties of Sound
Waves interact with objects based on the dimensions of the object
relative to the wavelength of the sound. Wavelengths much longer
than an object will smoothly bend around edges and spread out in
space. Wavelengths shorter than the dimensions of an object will
shadow like light: they behave more as beams. In between the
extremes, waves may bend partially as they encounter a surface
edge. This is known as diffraction.
Reflection occurs when a wave strikes a fixed object and bounces
back. Sound waves reflect at the same angle at which they
encounter an object. Since each collision absorbs some of the
sound energy, repeated reflections will ultimately cause the sound
to decay and since each collision removes energy in a frequencydependent way, the timbre of the sound changes as well as the
The shape and dimensions of a space largely determine how
sounds are altered as they radiate and decay inside that space.
The dimensions are important because the distance a sound must
traverse on its path from source to sensor will shape the
cancellations and reinforcements that occur as a function of the
wavelengths present. The reinforcement of certain frequencies
may occur is by resonance. The resonances of a room will color
the sound of reverberation as certain frequencies reinforce and
others do not.
Sound Transduction
Conversion of vibrations in an instrument or loudspeaker
into vibrations of air molecules
Conversion of air vibrations into mechanical vibrations, then
ultimately into electrical oscillations in a microphone or
electrochemical signals in the nervous system
Transduction of sound pressure waves into electromagnetic
There are several types of microphone: dynamic (moving
coil and ribbon), capacitor (also known as condenser) are
the main types. Each type has particular strengths and
Choosing a microphone is much like selecting an instrument:
there is no single ideal microphone for a given situation,
but some types sound better than others for particular
Sound Transduction
Sound Transduction
Dynamic Microphone
For moving coil dynamic microphone, the signal is created
when a coil of wire attached to a diaphragm moves in and
out, through a magnetic field, as the air pressure changes.
An electrical signal is created by induction as the wires in
the coil cut through the magnetic field.
The mass of the moving parts must be very low, so that it
does not require much energy to move the diaphragm. The
mass of the assembly limits the high-frequency response
of the dynamic microphone. Dynamic microphones tend to
be quite sturdy and of low cost, so they are commonly
used to record drums, amplifier outputs, human voices,
and other sources which produce high sound pressure
Sound Transduction
Capacitor Microphone
Commonly called condenser microphones, these microphones
function by making the diaphragm one plate of a capacitor. As
the diaphragm vibrates, it changes the capacitance of the
capacitor in proportion to the sound pressure level. The
capacitance is converted to a voltage by a special amplifier inside
the microphone. Since this requires outside power, capacitor
microphones require a battery or external power supply (known
as “phantom” power because it is transmitted back over the same
cable as the output signal by a special technique).
The capacitor itself must be charged, so some electricity must be
used to polarize the capacitor.
The mass of the diaphragm can be much smaller than that of a
dynamic diaphragm, the capacitor microphone is usually better
suited to high-frequency sound transduction. Condenser
microphones have very high impedances.
Electret microphone – replace the phantom voltage power with a
permanently charged material, e.g. Telfon, so there is no need
for external power supply
Sound Transduction
Microphone polar patterns
Each microphone exhibits a pattern of directional sensitivity:
that is, it is more sensitive to sounds arriving from certain
directions. The sensitivity of the microphone to sounds from a
particular direction is indicated by the radius from the center
of the plot to the perimeter at the angle in question.
Sensitivity is measured in terms of voltage output for a given
sound pressure level input.
In addition to the sensitivity, the frequency response of the
microphone also varies as the angle of incidence of the sound
changes. The so-called off-axis response affects the
microphone sound as it “colors” the sounds coming from the
sides and rear of the microphone.
Sound Transduction
Selection of microphone
The main selection criterion is the microphone’s sensitivity in
volts/dB SPL,
eg. -75 dB re: 1V/mbar,
10 mbar = 1 pascal = 94 dB SPL
Noise inherent in the microphone, i.e., “self-noise” or
equivalent input noise. Noise in dynamic microphones is
generated by random thermal processes that are impedance
related, hence the low-impedance of most microphones. Noise
in capacitor microphones is mostly generated in the internal
electronic circuitry.
Spacial sensitivity (i.e., polar pattern)
Maximum sound pressure that can be transduced. Overloads
in dynamic mics are generated by the physical limits of the
diaphragm, which is mechanically damped, allowing very high
sound pressure levels (up to 140 dB SPL) to pass undistorted.
Overloads in capacitor mics usually occur at the limits of the
electronics power supply voltage rather than due to physical
excursion limits in the sensing capsule. Even capacitor mics
can usually handle SPLs up to 130 dB SPL, with many capable
of transducing SPLs up to 160 dB SPL with the use of internal
pads (attenuators).
Sound Transduction
Loudspeakers convert electrical energy into sound pressure
Basic construction: a cone of paper or paper-like material is
suspended within a frame, and connected to the
loudspeaker's input. A permanent magnet surrounds the
center of the cone, and as alternating current is applied to
the loudspeaker cone, changes in the magnetic field
potential causes the cone to move away from or towards
the permanent magnet.
Sound Transduction
Most loudspeaker cabinets and their components share similar design
characteristics. Because of the large range of frequencies that are
needed to properly reproduce the audible range of sound, in most
cases, more than one component is used.
A loudspeaker component, called a driver, that reproduces the lower
frequencies of sound is called a woofer, while a component that is
dedicated to reproducing higher frequencies is called a tweeter.
Midrange driver is the name given to a component that is designed to
reproduce the middle frequencies. Since low frequency sounds have a
larger wavelength, low-frequency drivers tend to be larger.
Loudspeaker Enclosure
A primary function of a speaker enclosure is to keep the sound coming
from the back of a driver cone from going into the room. All drivers
should have their own individual enclosures.
Loudspeaker Placement
Coverage angle, intended coverage, system designs. Sound
reproduction is about creating illusions in our mind.
Sound Transduction
Loudspeaker Characteristics
Frequency response,
Phase response,
Powered, unpowered,
Coverage angle.
A set of priorities for loudspeaker design:
Low non-linear distortion, e.g. drivers that can move sufficient
amounts of air linearly in all parts of the frequency range but especially
at the low end.
Minimal excitation of room resonances, particularly at low frequencies.
This requires low frequency directional speakers such as dipoles.
Low amounts of stored energy in drivers, cabinet, air cavities and
filters for fast transient decays.
Smooth, extended frequency response from 20 Hz on up and without
exaggerated high frequencies, both on-axis and off-axis. Minimal rolloff in power response
There are additional requirements, such as an acoustic center for the
speaker at ear height, vertical extension of the source, etc.
Hearing and Human Auditory System
The functional role of the central structures of the auditory system,
namely, the nervous pathway and the parts of the brain that govern
sound reception, is still not fully known. However, it is clear that
the transformations of the acoustics environment produced by the
peripheral structures are central to the sound processing functions
of the auditory system. The nervous system’s cognitive response to
sound stimuli is known as psychoacoustics: it is partly acoustics and
partly psychology.
Human Ear
• partitioned into an outer,
middle and inner ear
• Outer ear (pinna, ear canal) gathers
and focuses sound onto the eardrum,
• Pinna: source elevation information,
front-to-rear discrimination
• Ear canal: introduces nonlinear
effects in the frequency domain. The
resonant frequency falls in the same
range as the peak in our sensitivity:
around 3kHz, and creates a maximum
boost of about 10 dB.
Hearing and Human Auditory System
Human Ear
The middle ear contains a complicated linkage of
bones for the transformation of time variations in air
pressure to time variations in fluid pressure in the
cochlea. It acts as an impedance matching device.
The cochlea is a coiled, fluid-filled structure located in
the inner ear. It is a dual-purpose structure: it
converts mechanical vibrations into neuronal electrical
signals and it separates the frequency content of the
incoming sound into discrete frequency bands.
Signal processing in the cochlea
Fluid variations are conveyed to the nervous system
through sensory cells embedded in the basilar
membrane, a fibrous tissue which extends through the
middle of the cochlea.
Hearing and Human Auditory System
Signal processing in the cochlea
The basilar membrane supports travelling waves along its extent.
The stationary envelope of the travelling wave reaches a maximum
at specific sites along the membrane which are proportional to the
frequency components of the sound. Because the basilar
membrane is narrow and stiff at the base of the cochlea, and wide
and flexible at the tip of the cochlea, the site of maximum excursion
of the travelling wave is near the opening onto the middle ear for
high frequencies, and toward the cochlea tip for low frequencies.
Hearing and Human Auditory System
Signal processing in the cochlea
Sensory receptors, known as hair cells, reside on the basilar
membrane. There are approximately 3,500 inner hair cells and
20,000 outer hair cells
The inner hair cells detect the velocity of the passing fluid. The
motion induces an electrical potential change across the hair cell
membranes. Each inner hair cell responds optimally to simulation at
a characteristic frequency. Stimulation with tones on either side of
the characteristic frequency results in diminished cellular response.
Inner hair cells also display a form of contrast enhancement. In the
auditory system, this effect is known as two-tone suppression. Cells
which are stimulated with a tone at the characteristic frequency and
with additional tones above and below the characteristic frequency
will not respond as strongly for frequencies around the characteristic
The outer hair cells are not sensory detectors. In fact, these cells
receive efferent connections from higher centers within the auditory
The outer hair cells form the terminal point in an
Automatic Gain Control (AGC) loop. These cells provide positive
feedback by pushing in the direction of motion. Inhibition from
higher centers decreases the activity of the outer hair cells,
resulting in natural signal damping. Thus, the outer hair cells act as
effectors of an AGC loop which performs early signal processing
resulting in an increased range of sensitivity.
Hearing and Human Auditory System
Signal processing in the cochlea
Communication of hair cell response to nervous system is
achieved via the auditory nerve.
Studies of the manner in which the auditory nerve conveys
The classical interpretation of the auditory nerve
representation is based on the premise that each ascending
nerve fibre innervates a specific portion of the basilar
membrane. Since each fibre innervates a single inner hair
cell, and each inner hair cell is sensitive to a particular
frequency in the stimulus, the fibers themselves are
"labeled" by frequency. Thus the fibers of the auditory
nerve fire optimally at a characteristics frequency, and these
fibers are organized to preserve the innervation pattern on
the basilar membrane. Therefore, the auditory nerve fibers
are said to be organized tonotopically. In this interpretation,
the auditory system conveys stimulus spectral content by
the average firing rate in each of the fibers of the auditory
This representation is called the rate-place
Hearing and Human Auditory System
Signal processing in the cochlea
An alternative interpretation arises from the observation that the auditory
nerve fibers are capable of firing in synchrony with the stimulus. Fibers
which innervate the inner hair cells near the apex of the cochlea are
stimulated by low-frequency components. If the duration of the stimulus is
longer than the duration of an action potential (1 ms), then the fibers are
phase-locked to the stimulus. These fibers are capable of representing the
temporal properties of the signal because their activity is directly correlated
with time-varying amplitude components of the signal. For stimuli whose
period is shorter than 1 ms, the auditory fibers are able to phase-lock to
multiples of the stimulus period. This suggests a representation of the
firing patterns of the auditory nerve which is referred to as a "temporalplace" code because the "temporal" measure of synchronization is
computed only for those fibers with a specific characteristic frequency.
The pattern of firing activity across auditory nerve fibers reflects the power
in the individual spectral components of the stimulus. Thus the rate-place
code represents spectral peaks.
The rate-place code is highly sensitive to stimulus and ambient noise level
while the temporal-place code is relatively insensitive to stimulus amplitude.
Temporal-place representation preserves spectrum peaks even for rapid
spectral changes in sounds. The temporal-place representation is capable
of retaining detailed spectral information for large stimulus amplitudes.
The temporal-place representation appears to be more robust for periodic
stimuli. The temporal-place representation is derived from the
synchronization behavior of the auditory fibers, and this property is
observed primarily in low to moderate frequency regions.
Human Auditory Perception
Our auditory system is incredibly sensitive, allowing perception over
many orders of magnitude in both amplitude and frequency. We can
discriminate tiny changes in sound timbres and accurately determine
where in space a sound originates. We process sound information
through our hearing organs and our perception includes distortions:
alterations in the way sounds are transmitted and converted to
neuronal signals and in the way our brain interprets these inputs and
renders for us what we refer to as hearing.
Human ears can hear in the range of 16 Hz to about 20 kHz. This will
change with age. Hence, wavelengths vary from 21.3 m to 1.7 cm. The
intensity of sound can be measured in terms of Sound Pressure Level
(SPL) in decibels (dBs).
Intensity level =
Where and
are values of acoustic power, and will deliver an
intensity of sound at the threshold of hearing, which is 10-12 W/m2
(watts per square meter).
The ear is a remarkably complex device. Through it our perception of
sound is characterized by a high frequency resolution and excellent
frequency discrimination (tones differing by as little as 0.3% being
discriminable), a wide dynamic range of over 120 dB, a fine temporal
resolution (able to detect differences as small as 10 µs in the timing of
clicks presented to the two ears), and rapid temporal adaptation (faint
sounds being audible as little as 10 ms after the cessation of a
moderately loud sound).
Human Auditory Perception
Loudness Perception
The sensitivity of the human ear is not the same for tones of all
frequencies. It is most sensitive to frequencies in the range
1000 to 4000 Hz. Low- and high-frequency sounds require a
higher intensity sound to be just audible, and our concept of
“loudness” also varies with frequency. The diagram below shows
the hearing threshold curve as a function of frequency.
Human Auditory Perception
Pitch Perception
Distinguishing audible tones by pitch enables us to order them
on a musical scale. The pitch perception may be explained by
two different theories. The place theory relates the pitch of a
tone to the place of maximum excitation on the basilar
membrane. The temporal theory relates pitch to the time
patterns of neural impulses. For pure tones, the ear can detect
a frequency difference of about 2 Hz for a 1000 Hz tone at
about 60-70 dB SPL. Accuracy is better for middle range
frequencies and for longer duration tones. For complex tones,
the ear is able to “hear out” some of the individual partials
within a complex tone.
Frequency Selectivity and Masking
The perception of frequency components in a sound can be
detected by a series of overlapping bandpass filters centred
continuously at frequencies throughout the normal range of
hearing. The ability to discriminate between two simultaneously
presented sounds which contain frequencies that are very close
is limited to the width of one of these auditory bandpass
filters – the critical band. The concept of frequency selectivity
explains a very common perceived effect – that of masking.
Briefly, a sound is masked if it cannot be heard in the presence
of another sound. The hearing threshold can be changed when 23
there is another sound appeared.
Human Auditory Perception
More on critical bands and frequency masking
Human auditory system has a limited, frequency-dependent resolution.
The perceptually uniform measure of frequency can be expressed in
terms of the width of the Critical Bands. It is less than 100 Hz at the
lowest audible frequencies, and more than 4 kHz at the high end. All
together, the audio frequency range can be partitioned into 25 critical
Bark Scale
A new unit for frequency bark (after Barkhausen) can be defined
according to the critical bands:
1 Bark = width of one critical band
Human Auditory Perception
An Experiment:
Put a person in a quiet room.
Raise the magnitude level of 1
kHz tone until just barely
audible. Vary the frequency and
Play 1 kHz tone (masking tone)
at fixed level of 60 dB. Play a
second tone (test tone) at a
different level (e.g., 1.1 kHz),
and raise the level until it is just
distinguishable. Vary the
frequency of the test tone and
plot the threshold when it
becomes audible:
Repeat for various frequencies
of masking tones
Human Auditory Perception
Temporal masking
If we hear a loud sound, then it
stops, it takes a little while until
we can hear a soft tone nearby.
Play 1 kHz masking tone at 60
dB, plus a test tone at 1.1 kHz
at 40 dB. Test tone can't be
heard (it's masked).
Stop masking tone, then stop
test tone after a short delay.
Adjust delay time to the
shortest time when test tone
can be heard (e.g., 5 ms).
Repeat with different level of
the test tone and plot:
Total effect of both frequency
and temporal maskings
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF