R T I D

R T I D

2. State of the art

2. State of the art

This chapter reviews the state of the art of binaural systems and binaural individualization approaches. Section

2.1

explains briefly binaural techniques, section

2.2

explains some individualization approaches based on geometric modeling and section

2.3

covers an individualization method based on anthropometry.

2.1. Fundamentals of binaural synthesis

Like wave field synthesis (WFS), Ambisonics or stereophony, binaural synthesis is a technology for sound spatialization. In other words, it aims to convey the illusion of sound sources located in the space.

Binaural technology is based on the axiom that all cues required for spatial hearing are encoded in the sound pressure at the eardrums, thus, audio information convolved with those spatial cues recreates or synthesizes virtual auditory environments (VAEs). Figure

2.1

presents a typical setup of the sound wave reconstruction.

Natural cues of auditory localization result, as explained in the previous chapter, from the diffraction of the acoustic wave by the listeners morphology.

Rozenn Nicol in a recent publication ( Nicol 2010 ) mentions some advantages of binaural

technology.

• Compactness, in the sense that this system is very portable, being able to create virtual sounds anywhere around the listener (full 3D spatialization) with only 2 signals (ie.

over headphones).

• Low cost, in terms of signal capture, storage and delivery.

• Compliant with loudspeaker reproduction, by using crosstalk cancellation.

• Good quality of sound spatialization in terms of immersion ( Nicol 2010

)

6

2. State of the art

Figure 2.1.: Graphical example of the binaural synthesis concept. Up: Perception of a real sound source. Down: reconstruction of the sound pressure at the ear-drums through binaural synthesis.

7

2. State of the art

Figure 2.2.: Binaural data set acquisition using the head an torso simulator FABIAN

( Lindau 2006

).

2.1.1. Encoding binaural information

The process of gathering cues for spatial localization, known as binaural encoding, is based on the synthesis of pertinent temporal (ITD) and spectral characteristics (ILD and spectral cues (SC)). There are several ways to acquire binaural signals, the most straightforward consists of recording an acoustic scene with the aid of a HATS (see fig.

2.2

) or a human subject

wearing microphones at the entrance of the ear-canals. Another way to create an artificial acoustic scene is to use binaural synthesis. This technique consists in convolving anechoic audio with binaural room transfer functions (BRTFs) or binaural room impulse responses

(BRIRs) that describe the acoustic path between the sound source and the listener’s ears.

Therefore the acquisition of binaural information consists in gathering data sets of HRTFs or their temporal counterpart head-related impulse responses (HRIR).

2.1.2. Decoding binaural information

Reproduction of the encoded sound pressure at the ear drums is not a trivial matter since headphones/earphones are not acoustically transparent and the recording and reproduc-

tion points are not identical, thus, headphone calibration is required ( McAnally and Russell

2002

). Frequency calibration accounts for:

• the frequency response of the acoustic emitter of the headphones,

• the acoustic coupling between headphone and the listener’s external ear

8

2. State of the art

Figure 2.3.: HpTF dependency on headphone positioning and the individual morphology.

10 successive headphone positions (Sennheiser D600) at the left ear of a representative subject are shown. The curves are shifted by 5 dB for legibility purposes. Note the differences for frequencies above 5 kHz. From

Nicol ( 2010 ).

Since the headphone transfer function (HpTF) depends not only on the individual pinna

morphology, but also on the headphone positioning ( Møller et al. 1995

) a common compromise to achieve calibration is to realize several measurements and compute the average

( Nicol 2010

). Figure

2.3

shows an example of 10 HpTF measurements at 10 consecutive positions. Note the differences above 5 kHz.

2.1.3. HRTF and localization cues

As already mentioned in Chapter

1

HRTFs contain the cues for spatial localization which are the essence of binaural synthesis.

Let us in this section discuss in more detail further characteristics of these localization cues.

9

2. State of the art

Figure 2.4.: Path length differences causing the interaural time difference

2.1.3.1. Low and high frequency ITD

Kuhn ( 1977

) analyzed the frequency dependency of the ITD based on the diffraction of a plane wave on a spherical head model. In his work the interaural phase difference (IPD) is computed as:

IPD

= 3ka · sin(

θ

)

(2.1) where k is the wave number, a is the head radius and

θ the azimuth incidence angle in radians. The ITD is then:

IT D

=

IPD

2

π

f

(2.2)

At low frequencies assuming

(ka)

2

≪ 1 the ITD becomes:

IT D lowF

=

3a

· sin(

θ

)

c

(2.3) where c is the speed of sound.

At high frequencies the diffracted wave can be considered as wave propagating at a speed close to the speed of sound c. Thus, it can be derived from the path difference of the symmetrically disposed ears including the path circumventing the spherical head. The ITD becomes:

IT D highF

=

a

· (sin

θ

+

θ

)

c

(2.4)

This equation matches exactly Woodworth’s formula

7

which is therefore identified as a high

frequency modeling of the ITD ( Algazi et al. 2001b

).

For angles close to the median plane (

θ

sin(

θ

)) IT D

lowF

is 150% bigger as the IT D

highF

.

Figure

2.5

shows the IT D

( f ) of two human subjects where the above mentioned relationship can be seen.

7

Woodworth’s formula will be addressed with more detail in section

2.2

10

2. State of the art

Figure 2.5.: ITD as a function of frequency on increasing azimuths (17

◦ to 90

) computed using HRTFs of selected subjects. 6 sound source locations in the azimuth plane

were considered. Note the low and high frequency ITD. From ( Nicol 2010

).

2.1.3.2. ITD estimation in binaural data sets

In a previous work ( Estrella 2010 ) and under the scope of ITD individualization, several

ITD estimation methods were analyzed and compared with each other. In the following a brief description of those methods and the results of the analysis is given.

Maximum of the interaural cross-correlation (MIACC)

This method consists of cross correlating the impulse responses of left and right ears with each other and measure the time to it’s maximum. According to

Mills ( 1958

) the threshold for detection of ITD changes is approx. 10µs when the conditions are optimal. For a samplerate of 44100 Hz the time difference between one sample to another is already 22µs. Therefore, for appropriate accuracy, the HRIRs should be first up-sampled.

This method seems to give erratic ITD values at lateral positions. These occurs most probably, due to the minor SNR of the contralateral IR and the lack of coherence between ipsi-

lateral and contralateral IRs at those angles ( Busson et al. 2005

).

Cross Correlation with minimum phase impulse responses

Nam et al.

( 2008

) showed that for the vast majority of HRIRs, the correlation between an impulse response and its minimum phase representation is over 0.9, thus, finding the times until maximum of this type of cross-correlation for left and right HRIRs and subtracting them from each other gives us the ITD.

11

2. State of the art

The method also presents discontinuities at lateral angles (around +- 50

◦ to 130

). It also requires a lot of processing time because the extraction of the minimum phase impulse responses and the cross-correlation are both realized with up-sampled IRs. BRIRs of large rooms which already consist on large vectors become problematic in this sense.

Onset detection

This method, also known as edge detection, measures the time up to a given threshold in the left and right onsets of the binaural IRs (ie. 10% of the peak in

Minnaar et al.

( 2000

)). The ITD equals the difference between the times found. For appropriate accuracy the IRs should be up-sampled.

Visual inspection of the data set helps finding an appropriate threshold. The ITD in BRIRs is reliably detected when using thresholds of -20 to -40 dB of the maximum peak. This estimation method performs quite fast and robust but it depends on the chosen threshold.

Interaural group delay difference at 0Hz, (IGD

0

)

phase and excess phase component

8

A HRTF can be decomposed in minimum-

. In this approach the ITD is the interaural group delay difference of the excess phase components evaluated at 0Hz.

However as binaural data sets are recorded using real electro acoustical transducers (loudspeakers and microphones) as well as AD converters utilizing DC-blockage, thus, they do not provide any useful information at 0 Hz (DC).

One approach to overcome this problem is to employ extrapolation using as reference data of a frequency range below 1,5 kHz, where according to

Minnaar et al.

( 2000

) the group delay should be almost constant.

This method has the disadvantage of being highly dependent on the frequency range chosen and requires a lot of computation time with longer impulse responses.

Phase delay fitting

This method was first proposed in

Jot et al.

( 1995

), it assumes that the excess phase of a HRTF is a linear function of frequency until 8 to 10 kHz. Since the excess phase component of the HRTF can be replaced with a pure delay, this delay can be calculated by fitting a linear curve on the excess-phase response between 1 kHz and 5 kHz for left and right ears and computing the difference.

Huopaniemi and Smith ( 1999

) proposed another frequency range, 500 Hz to 2 kHz, whereas

Minnaar et al.

( 2000

) states that the phase can only be linear as a function of frequency for frequencies below 1.5 kHz.

8

Decomposition of HTRfs is treated in

3.2

12

2. State of the art

ITDs of FABIAN method: Onset detection 10 x Oversampling.

800

600

400

200

0

−200

−400

−600

−800

−150 −100 −50 0 head orientation [degree]

50 100 150

Figure 2.6.: ITD extracted using the onset detection method with 10x up-sampling, threshold -3dB. Data set: FABIAN’s HRIRs (elevation: 0 resolution: 1

)

, azimuth:

−180

◦ to

+180

,

The frequency dependency of this method means that the ITD obtained varies according to the frequency evaluation range. In the work of

Algazi et al.

( 2001a

), it is also mentioned that phase related methods are problematic because of reflections and resonances of torso and pinnae causing unpredictable phase responses on the HRTFs.

Out of the methods analyzed in

Estrella ( 2010

) the threshold detection method seems to be the most appropriate since it delivers estimations of ITD which are continuous functions of the angle with most kinds of data sets, the method is computationally fast and delivers values which according to

Busson et al.

( 2005

) are most similar to the perceptually correct ones. Note that the high frequency ITD (achieved by the threshold method) has been found

to be perceptually satisfying ( Constan and Hartmann 2003

;

Nicol 2010 ).

Figure

2.6

shows an ITD estimation example using the onset detection method.

The performance of the threshold method compared to the perceptual ITD can be seen on figure

2.7

.

2.1.3.3. ILD as localization cue

The ILD as the other cues for spatial localization is closely related to the physical structure of the human subject or HATS involved in the data set acquisition (see fig.

2.8

). It is defined

13

2. State of the art

Figure 2.7.: Subjective ITD vs. ITD extracted with the Edge Detection method. Means of subject’s answers and standard deviations are plotted with continuous lines.

Dotted line represent the ITD estimation method. Only the horizontal plane is considered. From

Busson et al.

( 2005

).

as the difference of the spectrum magnitude of the left and right HRTFs. The ILD can be computed as:

Z

ILD

= 10 · log

Z

f

1

f

2

f

1

f

2

|H

L

( f )|

2

|H

R

( f )|

2 df

 df

(2.5) where

|H

L

,R

( f )| are the HRTF magnitude spectrum and f

1

, f

2 are the integration band. The boundaries that are usually chosen are 1 kHz and 5 kHz since that is the range where the

ILD plays a primary role as a localization cue ( Nicol 2010

).

2.1.3.4. SC as localization cue

Spectral cues are defined as salient features of the HRTF spectrum magnitude which can be interpreted by the auditory system to localize sounds. It has been demonstrated that SC

convey primarily information regarding sound elevation ( Langendijk and Bronkhorst 2002

;

Møller and Hoffmann 2006 ).

SC are mainly located within the 4 - 16 kHz band.

14

2. State of the art

Figure 2.8.: Acoustic shadowing causing the interaural level difference

2.1.4. Dynamic binaural auralization

Rendering of binaural signals that take head movements into account are referred to as dynamic binaural synthesis. Here exchanging BRIRs in real time according to listener’s head movements leads to a stable sound source localization.Dynamic binaural systems require a head-tracking system which information is used to adapt the binaural filters modifying the virtual direction of the sound source in order to keep the perception of the sound source at

a fixed location in the listener’s referential ( Nicol 2010

;

Sandvad 1996 ).

The advantages of dynamic binaural systems are:

• Reduction of front-back confusion compared to static systems ( Wenzel 1999

).

• According to

Nicol ( 2010

) subjects judge spatial attributes of the sound scenes in dynamic binaural systems as having increased spatial precision, externalization, envelopment and realism as static binaural systems.

Since the binaural filters need to be interactively changed two aspects have to be considered in dynamic binaural systems:

• Latency between head movements and system interaction, which should be a minimum as possible in order to keep the binaural system latency under 60 ms

9

.

• A common practice in order to avoid audible artifacts (discontinuities) caused by the commutation of binaural filters consists in the use of Cross-fading between the current and previous signals.

9 according to

Brungart et al.

( 2006

) latency values lower than 60 ms are likely to be adequate for most virtual audio applications, and delays of less than 30 ms are difficult to detect even in very demanding virtual auditory environments.

15

2. State of the art

Figure 2.9.: Woodworth and Schlosberg’s ITD computation method based on a spherical

head model. From ( Kuhn 1977

).

2.2. Individualization using geometrical models

In section

1.1

the problematic of non-individualized binaural reproduction was already addressed, in this section approaches of binaural individualization are reviewed.

The need for individualized spatial cues has lead many investigators to try to synthesize head related impulse responses (HRTFs) relating it’s characteristics to anthropometric parameters.

Woodworth et al.

( 1972 ) developed a formula for predicting the high frequency

ITD based on just one anthropometric parameter, the head radius (see fig.

2.9

). This formula

(eq.

2.6

) takes account of the diffraction of a plane wave around the sphere

10

:

IT D

=

a c

(sin

θ

+

θ

)

(2.6) with:

a = head radius

c = speed of sound

θ

= azimuth angle in [Rad]

∈ −

π

2

<

θ

<

π

2

Larcher and Jot ( 1999

) and

Savioja et al.

( 1999

) extended formula

2.6

to include the elevation dependency of the ITD and to cover the whole horizontal and frontal planes.

10

The Woodworth-Schlosberg formula (eq.

2.6

) contemplates only the horizontal plane.

16

2. State of the art

Figure 2.10.: Anthropometric measures used to find the optimal head radius in

Algazi et al.

( 2001b

).

The mentioned geometrical models use the the sound source direction

11

and the head radius as individualization parameter to synthesize the ITD. In order to apply them, a method for computing an equivalent head radius out of head dimensions is required.

2.3. Anthropometric aided individualization

Algazi developed an empirical formula to provide an optimal head radius for it’s use with

Woodworth’s ITD model ( Algazi et al. 2001b ). It’s equation is based on three anthropomet-

ric measures: head width, head height and head depth (X1, X2 and X3 respectively on eq.

2.7

).

a opt

= W

1

X

1

+W

2

X

2

+W

3

X

3

+ b[cm]

(2.7) with:

X

1

= head width/2

X

2

= head height/2

X

3

= head deep/2

By means of linear regression over the above mentioned head-dimensions

12

an empirical formula for predicting the optimal head radius was achieved.

a opt

= 0.51X

1

+ 0.019X

2

+ 0.18X

3

+ 3.2[cm]

(2.8)

11

In data set based binaural auralization the position of the sound sources is mostly unknown.

12

HRTF recordings conducted on 25 subjects male and female, caucasian and asian were used in this method.

Least squares fitting between the measured ITD and the ITD produced by the Woodworth’s formula was applied.

17

2. State of the art

Algazi’s anthropometric method for finding an optimal head-radius represent an enhancement on the applicability of the geometric models and at the same time an interesting approach for relating human-head’s dimensions to the individualized ITD.

However, the position of the sound sources has to be known in order to apply those geometric individualization models. In binaural data set based auralization the position of the sound sources is rarely known, thus, the above mentioned methods are not suitable for this purpose; nevertheless, the procedure of relating anthropometric head measures to the interaural time difference serve as inspiration for a new individualization approach.

18

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement

Table of contents

Languages