http://www.acoustics.hut.fi/publications/reports/sound_synth_report.pdf

http://www.acoustics.hut.fi/publications/reports/sound_synth_report.pdf
Helsinki University of Technology
Department of Electrical and Communications Engineering
Laboratory of Acoustics and Audio Signal Processing
Evaluation of Modern Sound
Synthesis Methods
Tero Tolonen, Vesa Vlimki, and Matti Karjalainen
Report 48
March 1998
ISBN 951-22-4012-2
ISSN 1239-1867
Espoo 1998
Table of Contents
1 Introduction
1
2 Abstract Algorithms, Processed Recordings, and Sampling
3
2.1 FM Synthesis . . . . . . . . . . . . . . . . . .
2.1.1 FM Synthesis Method . . . . . . . . .
2.1.2 Feedback FM . . . . . . . . . . . . . .
2.1.3 Other Developments of the Simple FM
2.2 Waveshaping Synthesis . . . . . . . . . . . . .
2.3 Karplus-Strong Algorithm . . . . . . . . . . .
2.4 Sampling Synthesis . . . . . . . . . . . . . . .
2.4.1 Looping . . . . . . . . . . . . . . . . .
2.4.2 Pitch Shifting . . . . . . . . . . . . . .
2.4.3 Data Reduction . . . . . . . . . . . . .
2.5 Multiple Wavetable Synthesis Methods . . . .
2.6 Granular Synthesis . . . . . . . . . . . . . . .
2.6.1 Asynhcronous Granular Synthesis . . .
2.6.2 Pitch Synchronous Granular Synthesis
2.6.3 Other Granular Synthesis Methods . .
3 Spectral Models
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
6
6
7
10
11
12
12
12
13
14
14
15
17
3.1 Additive Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Reduction of Control Data in Additive Synthesis by LineSegment Approximation . . . . . . . . . . . . . . . . . . . . . 19
3.2 The Phase Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Source-Filter Synthesis . . . . . . . . . . . . . . . . . . . . .
3.4 McAulay-Quatieri Algorithm . . . . . . . . . . . . . . . . . .
3.4.1 Time-Domain Windowing . . . . . . . . . . . . . . .
3.4.2 Computation of the STFT . . . . . . . . . . . . . . .
3.4.3 Detection of the Peaks in the STFT . . . . . . . . . .
3.4.4 Removal of Components below Noise Threshold Level
3.4.5 Peak Continuation . . . . . . . . . . . . . . . . . . .
3.4.6 Peak Value Interpolation and Normalization . . . . .
3.4.7 Additive Synthesis of Sinusoidal Components . . . . .
3.5 Spectral Modeling Synthesis . . . . . . . . . . . . . . . . . .
3.5.1 SMS Analysis . . . . . . . . . . . . . . . . . . . . . .
3.5.2 SMS Synthesis . . . . . . . . . . . . . . . . . . . . .
3.6 Transient Modeling Synthesis . . . . . . . . . . . . . . . . .
3.6.1 Transient Modeling with Unitary Transforms . . . . .
3.6.2 TMS System . . . . . . . . . . . . . . . . . . . . . .
3.7 Inverse FFT (FFT;1) Synthesis . . . . . . . . . . . . . . . .
3.8 Formant Synthesis . . . . . . . . . . . . . . . . . . . . . . .
3.8.1 Formant Wave-Function Synthesis and CHANT . . .
3.8.2 VOSIM . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Physical Models
4.1 Numerical Solving of the Wave Equation . . . . . . . . . . . . .
4.1.1 Damped Sti String . . . . . . . . . . . . . . . . . . . .
4.1.2 Dierence Equation for the Damped Sti String . . . . .
4.1.3 The Initial Conditions for the Plucked and Struck String
4.1.4 Boundary Conditions for Strings in Musical Instruments
4.1.5 Vibrating Bars . . . . . . . . . . . . . . . . . . . . . . .
4.1.6 Results: Comparison with Real Instrument Sounds . . .
4.2 Modal Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Modal Data of a Substructure . . . . . . . . . . . . . .
22
23
24
25
26
26
26
27
29
30
31
32
33
33
34
38
39
39
40
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
45
46
47
49
51
53
55
56
4.2.2 Synthesis using Modal Data . . . . . . . . . . .
4.2.3 Application to an Acoustic System . . . . . . .
4.3 Mass-Spring Networks the CORDIS System . . . . .
4.3.1 Elements of the CORDIS System . . . . . . . .
4.4 Comparison of the Methods Using Numerical Acoustics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Digital Waveguides and Extended Karplus-Strong Models
5.1 Digital Waveguides . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Waveguide for Lossless Medium . . . . . . . . . . . . . . . . .
5.1.2 Waveguide with Dispersion and Frequency-Dependent Damping
5.1.3 Applications of Waveguides . . . . . . . . . . . . . . . . . . .
5.2 Waveguide Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Scattering Junction Connecting N Waveguides . . . . . . . . .
5.2.2 Two-Dimensional Waveguide Mesh . . . . . . . . . . . . . . .
5.2.3 Analysis of Dispersion Error . . . . . . . . . . . . . . . . . . .
5.3 Single Delay Loop Models . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Waveguide Formulation of a Vibrating String . . . . . . . . .
5.3.2 Single Delay Loop Formulation of the Acoustic Guitar . . . .
5.4 Single Delay Loop Model with Commuted Body Response . . . . . .
5.4.1 Commuted Model of Excitation and Body . . . . . . . . . . .
5.4.2 General Plucked String Instrument Model . . . . . . . . . . .
5.4.3 Analysis of the Model Parameters . . . . . . . . . . . . . . . .
6 Evaluation Scheme
56
58
58
58
60
63
63
63
65
67
68
68
70
70
73
74
75
78
78
80
82
85
6.1 Usability of the Parameters . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Quality and Diversity of Produced Sounds . . . . . . . . . . . . . . . 87
6.3 Implemention Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7 Evaluation of Several Sound Synthesis Methods
91
7.1 Evaluation of Abstract Algorithms . . . . . . . . . . . . . . . . . . . 91
7.1.1 FM synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2
7.3
7.4
7.5
7.1.2 Waveshaping Synthesis . . . . . . . . . . .
7.1.3 Karplus-Strong Synthesis . . . . . . . . . .
Evaluation of Sampling and Processed Recordings
7.2.1 Sampling . . . . . . . . . . . . . . . . . .
7.2.2 Multiple Wavetable Synthesis . . . . . . .
7.2.3 Granular synthesis . . . . . . . . . . . . .
Evaluation of Spectral Models . . . . . . . . . . .
7.3.1 Basic Additive Synthesis . . . . . . . . . .
7.3.2 FFT-based Phase Vocoder . . . . . . . . .
7.3.3 McAulay-Quatieri Algorithm . . . . . . . .
7.3.4 Source-Filter Synthesis . . . . . . . . . . .
7.3.5 Spectral Modeling Synthesis . . . . . . . .
7.3.6 Transient Modeling synthesis . . . . . . .
7.3.7 FFT;1 . . . . . . . . . . . . . . . . . . . .
7.3.8 Formant Wave-Function Synthesis . . . . .
7.3.9 VOSIM . . . . . . . . . . . . . . . . . . .
Evaluation of Physical Models . . . . . . . . . . .
7.4.1 Finite Dierence Methods . . . . . . . . .
7.4.2 Modal Synthesis . . . . . . . . . . . . . . .
7.4.3 CORDIS . . . . . . . . . . . . . . . . . . .
7.4.4 Digital Waveguide Synthesis . . . . . . . .
7.4.5 Waveguide Meshes . . . . . . . . . . . . .
7.4.6 Commuted Waveguide Synthesis . . . . . .
Results of Evaluation . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 92
. 92
. 93
. 93
. 93
. 93
. 94
. 94
. 94
. 95
. 95
. 96
. 96
. 97
. 97
. 97
. 98
. 98
. 98
. 99
. 99
. 100
. 100
. 101
8 Summary and Conclusions
103
Bibliography
114
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
.
.
.
.
.
.
4
5
6
8
9
11
Time-varying additive synthesis, after (Roads, 1995). . . . . . . . . .
The additive analysis technique. . . . . . . . . . . . . . . . . . . . . .
The line-segment approximation in additive synthesis. . . . . . . . . .
The phase vocoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Source-lter synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of zero-phase windowing. . . . . . . . . . . . . . . . . . .
An example of the peak continuation algorithm. . . . . . . . . . . . .
An example of peak picking in magnitude spectrum. . . . . . . . . . .
A detail of phase spectra in a STFT frame. . . . . . . . . . . . . . . .
Additive synthesis of the sinusoidal signal components. . . . . . . . .
The analysis part of the SMS technique, after (Serra and Smith, 1990).
The synthesis part of the SMS technique, after (Serra and Smith, 1990).
An example of TMS. An impulsive signal (top) is analyzed. . . . . . .
An example of TMS. A slowly-varying signal (top) is analyzed. A
DCT (middle) is computed, and an DFT (magnitude in bottom) is
performed in the DCT representation. . . . . . . . . . . . . . . . . . .
3.15 A block diagram of the transient modeling part of the TMS system,
after (Verma et al., 1997). . . . . . . . . . . . . . . . . . . . . . . . .
18
19
20
21
22
25
27
28
28
29
32
33
35
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
Three FM systems. . . . . . . . . . . . . . . . . . .
Frequency-domain presentation of FM synthesis. . .
A comparison of three dierent FM techniques. . .
Waveshaping with four dierent shaping functions.
The Karplus-Strong algorithm. . . . . . . . . . . .
Frequency response of the Karplus-Strong model. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
37
3.16 A typical FOF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.17 The VOSIM time function. N = 11, b = 0:9, A = 1, M = 0, and
T = 10 ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1
4.2
4.3
4.4
Illustration of the recurrence equation of nite dierence method.
Models for boundary conditions of string instruments. . . . . . . .
A modal scheme for the guitar. . . . . . . . . . . . . . . . . . . .
A model of a string according to the CORDIS system. . . . . . .
.
.
.
.
48
50
57
60
5.1
5.2
5.3
5.4
5.5
d'Alembert's solution of the wave equation. . . . . . . . . . . . . . . .
The one-dimensional digital waveguide, after (Smith, 1992). . . . . .
Lossy and dispersive digital waveguides . . . . . . . . . . . . . . . . .
A scattering junction of N waveguides . . . . . . . . . . . . . . . . .
Block diagram of a 2D waveguide mesh, after (Van Duyne and Smith,
1993a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dispersion in digital waveguides . . . . . . . . . . . . . . . . . . . . .
Dual delay-line waveguide model for a plucked string with a force
output at the bridge. . . . . . . . . . . . . . . . . . . . . . . . . . . .
A block diagram of transfer function components as a model of the
plucked string with force output at the bridge. . . . . . . . . . . . . .
The principle of commuted waveguide synthesis. . . . . . . . . . . . .
An extended string model with dual-polarization vibration and sympathetic coupling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
An example of the eect of mistuning the polarization models. . . . .
An example of sympathetic coupling. . . . . . . . . . . . . . . . . . .
64
65
66
69
5.6
5.7
5.8
5.9
5.10
5.11
5.12
.
.
.
.
71
72
74
77
79
80
81
82
List of Tables
6.1 Criteria for the parameters of synthesis methods with ratings used in
the evaluation scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Criteria for the quality and diversity of synthesis methods with ratings
used in the evaluation scheme. . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Criteria for the implementation issues of synthesis methods with ratings used in the evaluation scheme. . . . . . . . . . . . . . . . . . . . 89
7.1 Tabulated evaluation of the sound synthesis methods presented in
this document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Abstract
In this report, several digital sound synthesis methods are described and evaluated.
The methods are divided into four groups according to a taxonomy proposed by
Smith. Representative examples of sound synthesis techniques in each group are
chosen. The evaluation criteria are based on those proposed by Jae. The selected
synthesis methods are rated with a discussion concerning each criterion.
Keywords: sound synthesis, digital signal processing, musical acoustics, computer
music
2
Preface
The main part of this work has been carried out as part of phase I of the TEMA
(Testbed for Music and Acoustics) project that has been funded within European
Union's Open Long Term Research ESPRIT program. The duration of phase I was 9
months during the year 1997. This report discusses digital sound synthesis methods.
As Deliverable 1.1a of the TEMA project phase I, it aims at giving guidelines for the
second phase of the project for the development of a sound synthesis and processing
environment.
The partners of the TEMA consortium in the phase I of the project were Helsinki
University of Technology, Staatliches Institut fr Musikforschung (Berlin, Germany),
the University of York (United Kingdom), and SRF/PACT (United Kingdom). In
Helsinki University of Technology (HUT), two laboratories were involved in the
TEMA project: the Laboratory of Acoustics and Audio Signal Processing and the
Telecommunications and Multimedia Laboratory.
The authors would like to thank Professor Tapio Tassu Takala for his support and guidance as the TEMA project leader at HUT. We are also grateful to
Dr. Dr. Ioannis Zannos, who acted as the coordinator of the TEMA project, and
representatives of other partners for smooth collaboration and fruitful discussions.
This report summarizes and extends the contribution in the TEMA project by
the HUT Laboratory of Acoustics and Audio Signal Processing. We would like to
acknowledge the insightful comments on our manuscript given by Professor Julius
O. Smith (CCRMA, Stanford University, California, USA), Dr. Davide Rocchesso
and Professor Giovanni de Poli (both at CSC-DEI, University of Padova, Padova,
Italy).
Espoo, March 26, 1998
Tero Tolonen, Vesa Vlimki, and Matti Karjalainen
4
1. Introduction
Digital sound synthesis methods are numerical algorithms that aim at producing
musically interesting and preferably realistic sounds in real time. In musical applications, the input for sound synthesis consists of control events only. Numerous
dierent approaches are available.
The purpose of this document is not to try to reach the details of every method.
Rather, we attempt to give an overview of several sound synthesis methods. The
second aim is to establish the tasks or synthesis problems that are best suited for a
given method. This is done by evaluation of the synthesis algorithms. We would like
to emphasize that no attempt has been made to put the algorithms in any precise
order as this, in our opinion, would be impossible.
The synthesis algorithms were chosen to be representative examples in each class.
With each algorithm, an attempt was made to give an overview of the method and
to refer the interested reader to the literature.
The approach followed for evaluation is based on a taxonomy by Smith (1991).
Smith divides digital sound synthesis methods into four groups: abstract algorithms,
processed recordings, spectral models, and physical models. This document follows
Smith's taxonomy in a slightly modied form and discusses representative methods
from each category.
Each method was categorized into one of the following groups: abstract algorithms, sampling and processed recordings, spectral models, and physical models.
More emphasis is given to spectral and physical modeling since these seem to provide more attractive future prospects in high-quality sound synthesis. In these last
categories there is more activity in research and, in general, their future potential
looks especially promising.
This document is organized as follows. Selected synthesis methods are presented
in Chapters 2 5. After that, evaluation criteria are developed based on those
proposed by Jae (1995). An additional criterion is included concerning the suitability of a method for distributed and parallel processing. The evaluation results
are collected in a table in which the rating of each method can be compared. The
document is concluded with a discussion of the features desirable in an environment
in which the methods discussed can be implemented.
1
Chapter 1. Introduction
2
2. Abstract Algorithms, Processed
Recordings, and Sampling
The rst experiments that can be interpreted as ancestors of computer music were
done at 1920's by composers like Milhaud, Hindemith, and Toch, who experimented
with variable speed phonographs in concert (Roads, 1995). In 1950 Pierre Schaeer
founded the Studio de Musique Concrte in Paris (Roads, 1995). In musique concr te
the composer works with sound elements obtained from recordings or real sounds.
The methods presented in this chapter are based either on abstract algorithms or
on recordings of real sounds. According to Smith (1991), these methods may become
less common in commercial synthesizers as more powerful and expressive techniques
arise. However, they still serve as a useful background for the more elaborate sound
synthesis methods. Particularly, they may still prove to be superior in some specic
sound synthesis problems, e.g., when simplicity is of highest importance, and we are
likely to see them in use for decades.
The chapter starts with three methods based on abstract algorithms: FM synthesis, waveshaping synthesis, and the Karplus-Strong algorithm. Then, three methods
utilizing recordings are discussed. These are sampling, multiple wavetable synthesis,
and granular synthesis.
2.1 FM Synthesis
FM (frequency modulation) synthesis is a fundamental digital sound synthesis technique employing a nonlinear oscillating function. FM synthesis in a wide sense consists of a family of methods each of which utilizes the principle originally introduced
by Chowning (1973).
The theory of FM was well established by the mid-twentieth century for radio
frequencies. The use of FM in audio frequencies for the purposes of sound synthesis
was not studied until late 60's. John Chowning at Stanford University was the rst
to study systematically FM synthesis. The time-variant structure of natural sounds
is relatively hard to achieve using linear techniques, such as additive synthesis (see
section 3.1). Chowning observed that complex audio spectra can be achieved with
just two sinusoidal oscillators. Furthermore, the synthesized complex spectra can
3
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
be varied in time.
2.1.1 FM Synthesis Method
In the most basic form of FM, two sinusoidal oscillators, namely, the carrier and
the modulator, are connected in such a way that the frequency of the carrier is
modulated with the modulating waveform. A simple FM instrument is pictured in
Figure 2.1 (a). The output signal y(n) of the instrument can be expressed as
y(n) = A(n) sin2fcn + I sin(2fmn)]
(2.1)
where A(n) is the amplitude, fc is the carrier frequency, I is the modulation index,
and fm is the modulating frequency. The modulation index I represents the ratio
of the peak deviation of modulation to the modulating frequency. It is clearly seen
that when I = 0, the output is the sinusoidal y(n) = A(n) sin(2fcn) corresponding
to zero modulation. Note that there is a slight discrepancy between Figure 2.1
(a) and Equation 2.1, since in the equation the phase and not the frequency is
being modulated. However, since these presentations are frequently encountered
in literature, e.g., in (De Poli, 1983 Roads, 1995), they are also adopted here.
Holm (1992) and Bate (1990) discuss the eect of phase and dierences between
implementations of the simple FM algorithm.
I
f
f
m
f
c
FD1
M
A (n)
fFD2
AFD (n)
AFDC(n)
β
y (n)
y (n)
FD1
(a)
yFD2 (n )
(b)
(a): A simple FM synthesis instrument. (b) one-oscillator feedback
system with output yFD1 (n) and two-oscillator feedback system with output yFD2 (n),
after (Roads, 1995).
Figure 2.1:
The expression of the output signal in Equation 2.1 can be developed further
4
2.1. FM Synthesis
(Chowning, 1973 De Poli, 1983)
y(n) =
1
X
k=;1
Jk (I ) sin j2(fc + kfmn)j
(2.2)
where Jk is the Bessel function of order k. Inspection of Equation 2.2 reveals that
the frequency-domain representation of the signal y(n) consists of a peak at fc and
additional peaks at frequencies
fn = fc nfm n = 1 2 as pictured in Figure 2.2. Part of the energy of the carrier waveform is distributed
to the side frequencies fn. Note that Equation 2.2 allows the partials be determined
analytically.
A
f
f - 2f
c
f - f
c
m
Figure 2.2:
m
f
c
f + f
c
m
f + 2f
c
m
Frequency-domain presentation of FM synthesis.
A harmonic spectrum is created when the ratio of carrier and modulator frequency is a member of the class of ratios of integers, i.e,
fc = N1 N N 2 Z:
1 2
fm N2
Otherwise, the spectrum of the output signal is inharmonic. Truax (1977) discusses
the mapping of frequency ratios into spectral families.
2.1.2 Feedback FM
In simple FM, the amplitude ratios of harmonics vary unevenly when the modulation
index I is varied. Feedback FM can be used to solve this problem (Tomisawa, 1981).
Two feedback FM systems are pictured in Figure 2.1 (b). The one-oscillator feedback
FM system is obtained from the simple FM by replacing the frequency modulation
5
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
I = 12, b = 1.5
I = 10, b = 0.8
0
0
−50
−50
0
0
0.05
0.1
0.15
0.2
−50
0
0
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
0.1
0.15
normalized frequency
0.2
−50
0
0
0.05
0.1
0.15
0.2
−50
0
0
−50
0
0.05
0.1
0.15
normalized frequency
0.2
0
(a)
0.05
(b)
A comparison of three dierent FM techniques. Spectra of one-oscillator
feedback FM are presented on top, those of two-oscillator feedback FM in the middle,
and spectra of simple FM on the bottom. The frequency values, fFD1 and fFD2 , of
the oscillators in the feedback system are equal. The modulation index M is set to
2. Parameter b is the feedback coecient.
Figure 2.3:
oscillator by a feedback connection from the output of the system. The two-oscillator
system uses a feedback connection to drive the frequency modulation oscillator.
Figure 2.3 shows the eect of the feedback connections. The spectra of signals
produced by the two feedback systems as well as the spectra of the signal produced
by the simple FM are computed for two sets of parameters in gures 2.3 (a) and
(b). The more regular behavior of the harmonics in the feedback systems is clearly
visible. Furthermore, it can be observed that the two-oscillator system produces
more harmonics for the same parameters.
2.1.3 Other Developments of the Simple FM
Roads (1995) gives on overview of dierent methods based on simple FM. The rst
commercial FM synthesizer, the GS1 digital synthesizer, was introduced by Yamaha
after developing the FM synthesis method patented by Chowning further. The rst
synthesizer was very expensive, and it was only after introduction of the famous
DX7 synthesizer that FM became the dominating sound synthesis method for years.
It is still used in many synthesizers, and in SoundBlaster-compatible computer
sound cards, chips, and software. Yamaha has patented the feedback FM method
(Tomisawa, 1981).
2.2 Waveshaping Synthesis
Waveshaping synthesis, also called nonlinear distortion, is a simple sound synthesis
method using a nonlinear shaping function to modify the input signal. First exper-
6
2.3. Karplus-Strong Algorithm
iments on waveshaping were made by Risset in 1969 (Roads, 1995). Arb (1979)
and Le Brun (1979) developed independently the mathematical formulation of the
waveshaping. Both also performed some empirical experiments with the method.
In the most fundamental form, waveshaping is implemented as a mapping of a
sinusoidal input signal with a nonlinear distortion function w. Examples of these
mappings are illustrated in Figure 2.4. The function w maps the input value x(n)
in the range ;1 1] to an output value y(n) in the same range. Waveshaping can be
very easily implemented by a simple table lookup, i.e., the function w is stored in a
table which is then indexed with x(n) to produce the output signal y(n).
Both Arb (1979) and Le Brun (1979) observed that the ratios of the harmonics could be accurately controlled by using Chebyshev polynomials as distortion
functions. The Chebyshev polynomials have the interesting feature that when a
polynomial of order n is used as a distortion function to a sinusoidal signal with
frequency !, the output signal will be a pure sinusoid with frequency n!. Thus,
by using a linear combination of Chebyshev polynomials as the distortion function,
the ratio of the amplitudes of the harmonics can be controlled. Furthermore, the
signal can be maintained bandlimited, and the aliasing of harmonics can be avoided.
See (Le Brun, 1979) for discussion on the normalization of the amplitudes of the
harmonics.
The signal obtained by the waveshaping method can be postprocessed, e.g.,
by amplitude modulation. This way the spectrum of the waveshaped signal has
components distributed around the modulating frequency fm spaced at intervals f0,
the frequency of the undistorted sinusoidal signal. If the spectrum is aliased, an
inharmonic signal may be produced. See (Arb, 1979) for more details and (Roads,
1995) for references on other developments of the waveshaping synthesis.
2.3 Karplus-Strong Algorithm
Karplus and Strong (1983) developed a very simple method for surprisingly highquality synthesis of plucked string and drum sounds. The Karplus-Strong (KS)
algorithm is an extension to the simple wavetable synthesis technique where the
sound signal is periodically read from a computer memory. The modication is
to change the wavetable each time a sample is being read. A block diagram of the
simple wavetable synthesis and a generic design of the Karplus-Strong algorithm are
shown in Figure 2.5 (a) and (b), respectively. In the KS algorithm the wavetable is
initialized with a sequence of random numbers, as opposed to wavetable synthesis
where usually a period of a recorded instrument tone is used.
The simplest modication that produces useful results is to average two consecutive samples of the wavetable as shown in Figure 2.5 (c). This can be written
as
(2.3)
y(n) = 21 y(n ; P ) + y(n ; P ; 1)]
7
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
Input
1
0.5
0
−0.5
−1
−1
−0.5
0
0.5
Shaping function
Output
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−0.5
0
0.5
1
−1
−1
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−0.5
0
0.5
1
−1
−1
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−0.5
0
0.5
1
−1
−1
1
4
0.5
2
0
0
−0.5
−2
−1
1
05
0
05
1
1
−4
1
−0.5
0
0.5
1
−0.5
0
0.5
1
−0.5
0
0.5
1
05
1
05
0
Waveshaping with four dierent shaping functions. The input function
is presented on the top.
Figure 2.4:
8
2.3. Karplus-Strong Algorithm
Wavetable
(delay of P samples)
z-P
Output
signal
z-P
Output
signal
z-P
Output
signal
(a)
Wavetable
(delay of P samples)
Modifier
(b)
Wavetable
(delay of P samples)
1
2
z-1
(c)
Wavetable
(delay of P samples)
z
-P
Output
signal
1
2
+ or - with
probability b
z-1
(d)
The Karplus-Strong algorithm. The simple wavetable synthesis is
shown in (a), a generic design with an arbitrary modication function in (b), a
Karplus-Strong model for plucked string tones in (c), and a Karplus-Strong model
for percussion instrument tones in (d), after (Karplus and Strong, 1983).
Figure 2.5:
9
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
where P is the delay line length. The transfer function of the simple modier lter
is
(2.4)
H (z) = 12 (1 + z;1 ):
This is a lowpass lter and it accounts for the decay of the tone. A multiply-free
structure can be implemented with only a sum and a shift for every output sample.
This structure can be used to simulate plucked string instrument tones.
The model for percussion timbres is shown in Figure 2.5 (d). Now the output
sample y(n) depends on the wavetable entries by
y(n) =
(1
; P ) + y(n ; P ; 1)] if r < b
; ; P ) + y(n ; P ; 1)] if r > b
2 (n
1
2 (n
(2.5)
where r is a uniformly distributed random variable between 0 and 1 and b is a
parameter called the blend factor. When b = 1, the algorithm reduces to that of
Equation 2.3. When b = 12 , drum-like timbres are obtained. With b = 0, the entire
signal is negated every p + 21 samples and a octave lower tone with odd harmonics
only is produced.
The KS algorithm is basically a comb lter. This can be seen by examining the
frequency response of the algorithm. In order to compute the frequency response, we
assume that we can feed a single impulse into the delay line that has been initialized
with zero values. We then compute the output signal and obtain a frequency domain
representation from the response, i.e., we interpret the output signal as the impulse
response of the system. The corresponding frequency response is depicted in Figure
2.6. Notice the harmonic structure and that the magnitude of the peaks decreases
with frequency, as expected.
Karplus and Strong (1983) propose modications to the algorithm including
excitation with a nonrandom signal. A physical modeling interpretation of the
Karplus-Strong algorithm is taken by Smith (1983) and Jae and Smith (1983).
Extensions to the Karplus-Strong algorithm are presented in Section 5.3.
2.4 Sampling Synthesis
Sampling synthesis is a method in which recordings of relatively short sounds are
played back (Roads, 1995). Digital sampling instruments, also called samplers, are
typically used to perform pitch shifting, looping, or other modication of the original
sound signal (Borin et al., 1997b).
Manipulation of recorded sounds for compositional purposes dates back to the
1920's (Roads, 1995). Later, magnetic tape recording permitted cutting and splicing
of recorded sound sequences. Thus, editing and rearrangement of sound segments
was available. In 1950 Pierre Schaeer founded the Studio de Musique Concrte
at Paris and began to use tape recorders to record and manipulate sounds (Roads,
1995). Analog samplers were based on either optical discs or magnetic tape devices.
10
2.4. Sampling Synthesis
50
Magnitude (dB)
40
30
20
10
0
−10
0
0.2
0.4
0.6
0.8
1
Normalized frequency
Figure 2.6:
Frequency response of the Karplus-Strong model.
Sampling synthesis typically uses signals of several seconds. The synthesis itself
is very ecient to implement. In its simplest form, it consists only of one table
lookup and pointer update for every output sample. However, the required amount
of memory storage is huge. Three most widely used methods to reduce the memory
requirements are presented in the following. They are looping, pitch shifting, and
data reduction (Roads, 1995). The interested reader should also consult a work by
Bristow-Johnson (1996) on wavetable synthesis.
2.4.1 Looping
One obvious way of reducing the memory usage in sampling synthesis is to apply
looping to the steady state part of a tone (Roads, 1995). With a number of instrument families the tone stays relatively constant in amplitude and pitch after the
attack, until the tone is released by the player. The steady-state part can thus be
reproduced by looping over a short segment between so called loop points. After
the tone is released the looping ends and the sampler will play the decay part of the
tone.
The samples provided with commercial samplers are typically pre-looped, i.e., the
loop points are already determined for the user. For new samples the determination
of the looping points has to be done by the user. One method is to estimate the
pitch of the tone and then select a segment of length of a multiple of the wavelength
of the fundamental frequency. This kind of looping technique tends to create tones
with smooth looping part and constant pitch (Roads, 1995). If the looping part
is too short, an articial sounding tone can be produced because the time-varying
11
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
qualities of the tone are discarded. The looping points can also be spliced or crossfaded together. A splice is simply a cut from one sound to the next and it is bound
to produce a perceivable click unless done very carefully. In cross-fading the end of
a looping part is faded out as the beginning of the next part is faded in. Even more
sophisticated techniques for the determination of good looping points are available,
see (Roads, 1995) for more information.
2.4.2 Pitch Shifting
In less expensive samplers there might not be enough memory capacity to store
every tone of an instrument, or not all the notes have been recorded. Typically only
every third or fourth semitone is stored and the intermediate tones are produced
by applying pitch shifting to the closest sampled tone (Roads, 1995). This reduces
the memory requirements by a factor of three or four, thus the data reduction is
signicant.
Pitch shifting in inexpensive samplers is typically carried out using simple timedomain methods that aect the length of the signal. The two methods usually employed are: varying the clock frequency of the output digital-to-analog converter and
sample-rate conversion in the digital domain (Roads, 1995). More elaborate method
for pitch shifting exist, see (Roads, 1995) for references. One way of achieving sampling rate conversion is to use interpolated table reads with adjustable increments.
This can be done using fractional delay lters described in (Vlimki, 1995 Laakso
et al., 1996).
2.4.3 Data Reduction
In many samplers the memory requirements are tackled by data reduction techniques. These can be divided into plain data reduction where part of the information is merely discarded and data compression where the information is packed into
a more economical form without any loss in the signal quality.
Data reduction usually degrades the perceived audio quality. It consists of simple
but crude techniques that either lower the dynamic range of the signal by using less
bits to represent each sample or reduce the sampling frequency of the signal. These
methods decrease the signal-to-noise ratio or the audio bandwidth, respectively.
More elaborate methods exit and these usually take into account the properties of
the human auditory system (Roads, 1995).
Data compression eliminates the redundancies present in the original signal to
represent the information more memory-eciently. Compression should not degrade
the quality of the reproduced signal.
2.5 Multiple Wavetable Synthesis Methods
Multiple wavetable synthesis is a set of methods that have in common the use of
12
2.6. Granular Synthesis
multiple wavetables, i.e. sound signals stored in a computer memory. The most
widely used methods are wavetable cross-fading and wavetable stacking (Roads,
1995). Horner et al. (1993) introduce methods obtaining optimal wavetables to
match signals of existing real instruments.
In wavetable cross-fading the tone is produced from several sections each consisting of a wavetable that is multiplied with an amplitude envelope. These portions
are summed together so that a sound event begins with one wavetable that is then
cross-faded to the next. This procedure is repeated over the course of the event
(Roads, 1995). A common way to use wavetable cross-fading is to start a tone with
a rich attack, such as a stroke or a pluck on a string, and then cross-fade this into
a sustain part of a synthetic waveform (Roads, 1995).
Wavetable stacking is a variation of the additive synthesis discussed in Section
3.1. Several arbitrary sound signals are rst multiplied with an amplitude envelope
and then summed together to produce the synthetic sound signal. Using wavetable
stacking, hybrid timbres can be produced combining elements of several recorded
or synthesized sound signals. In commercial synthesizers usually from four to eight
wavetables are used in wavetable stacking.
In (Horner et al., 1993) methods for matching the time-varying spectra of a
harmonic wavetable-stacked tone to an original are presented. The objective is to
nd wavetable spectra and the corresponding amplitude envelopes that produce a
close t to the original signal. First, the original signal is analyzed using an extension
of the McAulay-Quatieri (McAulay and Quatieri, 1986) (see Section 3.4) analysis
method. A genetic algorithm (GA) and principal component analysis (PCA) are
applied to obtain the basis spectra. The amplitude envelopes are created by nding
a solution that minimizes a least squares error. The method produced good results
with four wavetables when the GA was applied (Horner et al., 1993).
2.6 Granular Synthesis
Granular synthesis is a set of techniques that share a common paradigm of representing sound signals by sound atoms or grains. Granular synthesis originated
from the studies by Gabor in the late 40's (Cavaliere and Piccialli, 1997 Roads,
1995). The synthetic sound signal is composed by adding these elementary units in
the time domain.
In granular synthesis one sound grain can have duration ranging from one millisecond to more than a hundred milliseconds and the waveform of the grain can be
a windowed sinusoid, a sampled signal, or obtained from a physics-based model of
a sound production mechanism (Cavaliere and Piccialli, 1997). The granular techniques can be classied according to how the grains are obtained. In the following
a classication derived from one given by Cavaliere and Piccialli (1997) is presented
and the techniques of each category are shortly described.
13
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
2.6.1 Asynhcronous Granular Synthesis
Asynchronous granular synthesis (AGS) has been developed by Roads (1991, 1995).
It is a method that scatters sound grains in a statistical manner over a region in
the time-frequency plane. The regions are called sound clouds and they form the
elementary unit the composer works with (Roads, 1995). A cloud is specied by the
following parameters: start time and duration of a cloud, grain duration, density
of grains, amplitude envelope and bandwidth of the cloud, waveform of each grain,
and spatial distribution of the cloud.
The grains of a cloud can all have similar waveforms or a random mixture of
dierent waveforms. A cloud can also mutate from grains with one waveform to
grains with another over the duration of the cloud. The duration of a grain eects
also its bandwidth. The shorter a grain is, the more it is spread in the frequency
domain. Pitched sounds can be created with low bandwidth clouds. Roads (1991,
1995) gives a more detailed discussion on the parameters of the AGS.
AGS is eective in creating new sound events that are not easily produced by
musical instruments. On the other hand, simulations of existing sounds are very
hard to achieve with AGS. In the following discussion granular synthesis methods
better suited for that are presented.
2.6.2 Pitch Synchronous Granular Synthesis
Pitch synchronous granular synthesis (PSGS) is a method developed by De Poli and
Piccialli (1991). The method is also discussed in (Cavaliere and Piccialli, 1997) and
briey in (Roads, 1995). In PSGS grains are derived from the short-time Fourier
transform (STFT). The signal is assumed to be (nearly) periodic and, rst, the fundamental frequency of the signal is detected. The period of the signal is used as
the length of the rectangular window used in the STFT analysis. When used synchronously to the fundamental frequency, the rectangular window has the attractive
property of minimizing the side eects that occur with windowing, i.e., the spectral
energy spread.
After windowing, a set of analysis grains are obtained in such a way that each
grain corresponds to one period of the signal. From these analysis grains, impulse
responses corresponding to prominent content in the frequency domain representation are derived. Methods for the system impulse response estimation include linear
predictive coding (LPC), and interpolation of the frequency domain representation
of a single period of the signal (Cavaliere and Piccialli, 1997).
In the resynthesis stage, a train of impulses is used to drive a set of parallel FIR
lters obtained from the system impulse response. The period of the pulse train is
obtained from the detected fundamental frequency. See (De Poli and Piccialli, 1991)
for transformations that can create variations to the produced signal.
14
2.6. Granular Synthesis
2.6.3 Other Granular Synthesis Methods
Some sound synthesis methods presented elsewhere in this document can also be
interpreted as special cases of granular synthesis. These include the wave-function
synthesis (Section 3.8.1) and VOSIM (Section 3.8.2). In fact, all methods that use
the overlap-add technique for synthesizing sound signals can be thought of as being
instances of granular synthesis.
Another possibility is to apply the wavelet transform to obtain a multiresolution
representation of the signal. See (Evangelista, 1997) for a discussion on wavelet
representations of musical signals.
15
Chapter 2. Abstract Algorithms, Processed Recordings, and Sampling
16
3. Spectral Models
Spectral sound synthesis methods are based on modeling the properties of sound
waves as they are perceived by the listener. Many of them also take the knowledge
of psychoacoustics into account. Spectral sound synthesis methods are general in
that they can be applied to model a wide variety of sounds.
In this chapter, three traditional linear synthesis methods, namely, additive synthesis, the phase vocoder, and source-lter synthesis, are rst discussed. Second,
McAulay-Quatieri algorithm, Spectral Modeling Synthesis (SMS), Transient Modeling Synthesis (TMS), and the inverse-FFT based additive synthesis method (FFT;1
synthesis) are described. Finally, two methods for modeling the human voice are
shortly addressed. These methods are the CHANT and the VOSIM.
3.1 Additive Synthesis
Additive synthesis is a method in which a composite waveform is formed by summing sinusoidal components, for example, harmonics of a tone, to produce a sound
(Moorer, 1985). It can be interpreted as a method to model the time-varying spectra
of a tone by a set of discrete lines in the frequency domain (Smith, 1991).
The concept of additive synthesis is very old and it has been used extensively
in electronic music see (Roads, 1995, pp. 134136) for references and historical
treatment. In 1964 Risset (1985) applied the method for the rst time to reproduce
sounds based on the analysis of recorded tones. With this application to trumpet
tones, the control data was reduced by applying piecewise-linear approximation
of the amplitude envelopes. Many of the modern spectral modeling methods use
additive synthesis in some form. These methods are discussed later in this chapter. A
block diagram of additive synthesis with slowly-varying control functions is depicted
in Figure 3.1.
In additive synthesis, three control functions are needed for every sinusoidal
oscillator: the amplitude, frequency, and phase of each component. In many cases
the phase is left out and only the amplitude and frequency functions are used. The
17
Chapter 3. Spectral Models
k=
0
1
2
M -1
F ( n)
k
A ( n)
k
sin
Figure 3.1:
Time-varying additive synthesis, after (Roads, 1995).
output signal y(n) is the sum of the components and can be represented as
y(n) =
M
X;1
k=0
Ak (n) sin2Fk (n)]
(3.1)
where T is the sampling interval, n is the time index, M is the number of the
sinusoidal oscillator, !k is the radian frequency of the oscillator, Ak (n) is the time
varying amplitude of the kth partial and Fk (n) is the frequency deviation of the kth
partial. If the tone is harmonic, !k is a multiple of the fundamental radian frequency
!0, i.e., !k = k!0. Ak (n) and Fk (n) are assumed to be slowly time-varying.
The control functions can be obtained with several procedures (Roads, 1995).
One is to use arbitrary shapes, for instance, some composers have tracked the shapes
of mountains or urban sky lines. The functions can also be generated using composition programs. An analysis method can be applied to map a natural sound into
a series of control functions. Such a system is pictured in Figure 3.2. The Short
Time Fourier Transform (STFT) is calculated from the input signal. Harmonics
18
3.1. Additive Synthesis
x ( n)
STFT
F ( n)
k
A ( n)
k
Frequency and amplitude trajectories in the time domain
The additive analysis technique, after (Roads, 1995). A STFT is
calculated from the input signal. Frequency and amplitude trajectories in the time
domain are formed.
Figure 3.2:
are mapped to peaks in the frequency domain, and their frequency and amplitude
functions can be detected from the series of STFT frames. These control functions
can be used directly to synthesize tones in a system of Figure 3.1.
The main drawbacks of the additive synthesis are the enormous amount of data
involved and the demand for a large number of oscillators. The method gives best
results when applied to harmonic or almost harmonic signals where little noise is
present. Synthesis of noisy signals requires a vast number of oscillators. A method
for the reduction of control data is discussed in the following subsection.
3.1.1 Reduction of Control Data in Additive Synthesis by
Line-Segment Approximation
There are several ways to reduce the amount of control data needed (Roads, 1995,
p. 149), (Moorer, 1985). The main criteria for the data reduction method are 1) to
retain the intuitively appealing form of the control data, i.e., the composer has to
be able to easily modify the control data to obtain musically interesting eects on
sound, and 2) to preserve the original sound in the absence of transformation.
Line-segment approximation can be utilized to obtain a set of piecewise linear curves approximating the frequency and the amplitude control functions. The
19
Chapter 3. Spectral Models
method has been used by Risset (1985), and it is also discussed by Moorer (1985)
and Strawn (1980). The idea of the line-segment approximation is to t a set of
straight lines to each control function in such a way that the curve obtained resembles the original curve. An example of the line-segment approximation is illustrated
in Figure 3.3 where the amplitude trajectory of the 4th partial of a ute tone has
been approximated using line segments.
Figure 3.3:
The line-segment approximation in additive synthesis.
When 32 partials of a tone are approximated by line-segment approximation
using ten segments with 16 bit numbers for each partial, the result is approximately
5120 bits of data for a half-second tone. The same tone when using sampling rate
of 44 100 Hz and 16 bit samples amounts to 352 800 bits. Thus the data reduction
ratio is about 1 to 69.
3.2 The Phase Vocoder
The phase vocoder was developed at Bell laboratories and was rst described by
Flanagan and Golden (1966). All vocoders present the input signal in multiple
parallel channels, each of which describes the signal in a particular frequency band.
Vocoders simplify the complex spectral information and reduce the amount of data
needed to present the signal.
In the original channel vocoder (Dudley, 1939) the signal is described as an
excitation signal and values of short time amplitude spectra measured at discrete
frequencies. The phase vocoder, however, uses complex short time spectra and thus
preserves the phase information of the signal.
The analysis part of the method can be considered to be either a bank of bandpass
lters or a short-time spectrum analyzer. These viewpoints are mathematically
equivalent for, in theory, the original signal can be reproduced undistorted (Gordon
and Strawn, 1985 Dolson, 1986). Portno (1976) gives a mathematical treatment
on the subject, and he also shows that the phase vocoder can be formulated as an
20
3.2. The Phase Vocoder
identity system in the absence of parameter modications. An introductory text of
the phase vocoder can be found in (Serra, 1997a).
The implementation of the phase vocoder using the STFT is computationally
more ecient than using a lter bank, since the complex spectrum can be evaluated
with the fast Fourier transform (FFT) algorithm. Detailed discussions on the phase
vocoder and practical implementations are given by Portno (1976), Moorer (1978),
and Dolson (1986). Code for implementing the phase vocoder can be found in
(Gordon and Strawn, 1985) and (Moore, 1990).
The phase vocoder is pictured in Figure 3.4. The input signal is divided into
equally spaced frequency bands. This can be done by applying the STFT to the
windowed signal. Each bin of the STFT frame corresponds to the magnitude and
phase values of the signal in that frequency band at the time of the frame.
frequency
analysis
synthesis
real
imaginary
x ( n)
inverse FFT
with overlapadd
STFT
y ( n)
time
The phase vocoder, after (Roads, 1995). First the STFT is calculated
from the input signal. The signal is now presented as a multiple series of complex
number pairs corresponding to the signal components in each frequency band. The
output signal is composed by calculating the inverse FFT for each frame and by
using the overlap-add method to reconstruct the signal in the time domain.
Figure 3.4:
Time scaling and pitch transposition are eects that can be easily performed
using the phase vocoder (Dolson, 1986 Serra, 1997a). Time-varying ltering can
also be utilized (Serra, 1997a). Time scaling is done by modifying the hop size in the
synthesis stage. If the hop size is increased, each STFT frame will eectively sound
longer and the produced signal is a stretched version of the original. If the hop size
is reduced the opposite occurs. The modication of the hop size has to be taken
into account in the analysis stage by choosing a window that minimizes the side
eects. Otherwise, some output samples are given more weight and the synthetic
signal is amplitude modulated. The phase values need also to be compensated for
in the modication stage. The phase values have to be multiplied by a scaling
factor in order to retain the correct frequency. Pitch shifting without changing
the temporal evolution can be accomplished by rst modifying the time scale by
the desired pitch-scaling factor and then changing the sampling rate of the signal
21
Chapter 3. Spectral Models
correspondingly. See (Serra, 1997a) for examples and more details on time-scale
modications and problems that arise when the frequency resolution of the STFT
analysis is not sucient. The problem of phasiness in time-scale modications
is discussed by Laroche and Dolson (1997) where also a phase synchronization is
proposed.
The phase vocoder works best when used with harmonic and static or slowly
changing tones. It has diculties with noisy and rapidly changing sound signals.
These signals can be modeled better with a tracking phase vocoder or the spectral
modeling synthesis described in Section 3.5.
3.3 Source-Filter Synthesis
In source-lter synthesis the sound waveform is obtained by ltering an excitation
signal with a time-varying lter. The method is sometimes called subtractive synthesis. The technique has been used especially to produce synthetic speech see, e.g.,
(Moorer, 1985) for more details and references, but also for musical applications
(Roads, 1995).
A block diagram of the method is depicted in Figure 3.5. The idea is to have a
broadband or harmonically rich excitation signal which is ltered to get the desired
output, as opposed to additive synthesis where the waveform is composed as a
sum of simple sinusoidal components. In theory, an arbitrary periodic bandlimited
waveform can be generated from a train of impulses by ltering. Complex waveforms
are simple to generate by using a complex excitation signal. A new method to
generate bandlimited pulse trains is introduced by Stilson and Smith (1996).
a (n)
b ( n)
White noise
generator
H ( z)
Time-varying
Impulse train
generator
filter
Source-lter synthesis. The transfer function of the time-varying lter
H (z) is described by lter coecients a(n) and b(n).
Figure 3.5:
The human voice production mechanism can be approximated as an excitation
sound source feeding a resonating system. When source-lter synthesis is used to
synthesize speech, the sound source generates either a periodic pulse train or white
noise depending on whether the speech is voiced or unvoiced, respectively. The lter
PK b z;k0
=0 k
H (z) = kP
(3.2)
1 + Ll=1 al z;l
22
3.4. McAulay-Quatieri Algorithm
models the resonances of the vocal tract. The coecients a(n) and b(n) of the lter
vary with time thus simulating the movements of lips, the tongue and other parts
of the vocal tract. The periodic pulse train simulates the glottal waveform. Many
traditional musical instruments have a stationary or slowly time-varying resonating
system, and source-lter synthesis can be used to model such instruments. The
method has also been used in analog synthesizers. When applied to speech or
singing, the method can be interpreted as physical modeling of the human sound
production mechanism.
The excitation signal and the lter coecients fully describe the output waveform. If only a wideband pulse train and noise is used, it is enough to decide
between unvoiced and voiced sounds. If the pulse form is xed, only the period, i.e.,
the fundamental frequency of the pulse train, remains to be determined.
Detection of the pitch is not a trivial problem and it has been studied extensively
mainly by speech researchers. Pitch detection methods can be divided into ve categories: time-domain methods, autocorrelation-based methods, adaptive ltering,
frequency-domain methods, and models of the human ear. These are discussed,
e.g., in (Roads, 1995) with references.
The lter coecients can be eciently computed by applying linear predictive
(LP) analysis. The basic idea of LP is that it is possible to design an all-pole lter
whose magnitude frequency response closely matches that of an arbitrary sound.
The dierence between STFT and LP is that LP measures the envelope of the
magnitude spectrum whereas the STFT measures the magnitude and phase at a
large number of equally spaced points. LP is a parametric method whereas STFT
is non-parametric and LP is optimal in that it is the best match of the spectrum in
the minimum-squared-error sense. The method is discussed in detail by Makhoul
(1975).
The fundamental frequency of the waveform depends only on the fundamental
frequency of the pulse train. So the timing and the fundamental frequency can
be varied independently. Also, the synthesis system can be excited with a complex
waveform, thus creating new sounds that have characteristics of the excitation sound
as well as the resonating system.
Although, in theory, arbitrary signals can be produced, source-lter synthesis
is not a very robust representation for generic wideband audio or musical signals.
Ways to improve the sound quality are discussed by Moorer (1979).
3.4 McAulay-Quatieri Algorithm
An analysis-based representation of sound signals suitable for additive synthesis has
been presented by McAulay and Quatieri (1986). The McAulay-Quatieri (MQ)
algorithm originated from research of speech signals but it was already reported in
the rst study that the algorithm is capable of synthesizing a broader class of sound
signals (McAulay and Quatieri, 1986). Other algorithms of parameter estimation
23
Chapter 3. Spectral Models
for additive synthesis exist (Risset, 1985 Smith and Serra, 1987) and the MQ and
related algorithms have been utilized in many spectral modeling systems (Serra and
Smith, 1990 Fitz and Haken, 1996).
In the MQ algorithm the original signal is decomposed into signal components
that are resynthesized as a set of sinusoids. The kth signal component at time
location l is represented as a set of triplets fAlk !kl lk g constituting three types of
trajectories, namely, amplitude, frequency, and phase trajectories, that are used in
the synthesis stage. The time locations l are determined by the hop size parameter
Nhop of the STFT as
l = nNhop n = 0 1 2 : : :
The MQ algorithm can be programmed to adapt to the analysis signal, e.g., the
number of detected signal components and the hop size parameter can vary in time.
The method is ecient in presenting harmonic or voiced signals with little noise
or transitions. If noisy or unvoiced signals are to be reproduced, a large number
of sinusoids is needed. In the example described in the study by McAulay and
Quatieri (1986), the maximum number of the detected peaks was set to 80 with the
sampling frequency of 10 kHz and hop size of 10 ms. It was shown that waveforms
of harmonic signals are reproduced accurately, whereas the reproduced waveforms
of noisy signals do not resemble the original well.
The analysis part of the MQ algorithm uses the STFT to obtain a representation
for each signal component. The input signal x(n) is windowed in the time domain
to a length ranging typically between 4 and 30 ms. A discrete Fourier transform
(DFT) is computed from the windowed signal xw (n). Peaks in the complex spectrum
Xw (n) corresponding to sinusoidal signal components are detected and they are
used to obtain the amplitude, frequency, and phase trajectories that compose the
signal representation. The analysis steps are elaborated further in the following
subsections.
3.4.1 Time-Domain Windowing
Choosing a proper window function is a compromise between the width of the main
lobe (frequency resolution of each signal component) and attenuation of the side
lobes (spreading in the frequency domain). A detailed discussion of window functions
is given by Harris (1978) and Nuttall (1981). In the original study, McAulay and
Quatieri utilize a Hamming window. In the frequency domain, it has a 43 dB
attenuation of the largest side lobe and an asymptotic decay of 6 dB/octave (Nuttall,
1981). The same window function is also used in the examples presented in this
section.
The length of the window function determines the time resolution of the analysis.
The length of the Hamming window function should be at least 2 1/2 times the
period of the fundamental frequency (McAulay and Quatieri, 1986). The window
length can be time-varying to adapt to the analyzed signal.
It is benecial to increase the frequency resolution of the DFT by increasing
24
3.4. McAulay-Quatieri Algorithm
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
Amplitude
1
−1
−200 −100
0
100
Time (samples)
200
−1
0
200
400
Time (samples)
−1
0
500
Time (samples)
1000
An example of zero-phase windowing. On the left a signal is windowed
about the time origin. An equivalent signal for the DFT is displayed in the middle.
In zero padding the zeros are inserted in the middle of the signal as shown on the
right.
Figure 3.6:
the length of the windowed signal by concatenating the windowed signal with zeros
(Smith and Serra, 1987). This is called zero padding and typically the length of
the windowed signal is increased to a power of two to allow for the use of the fast
Fourier transform (FFT), an ecient algorithm for the computation of the DFT.
Note, however, that the frequency resolution in the analysis is further improved by
applying an interpolation scheme proposed by (Smith and Serra, 1987).
For the detection of the phase values it is important to use zero-phase windowing
to avoid a linear trend in the phase spectra (Serra, 1989). An example of zero-phase
windowing is shown in Figure 3.6. On the left a portion of a guitar signal is windowed
using a Hamming window with a length of 501 samples, and the windowed signal
is centered about the time origin at indices ;250 ;249 : : : 251. In practice, the
circular properties of the DFT are used and the left half (indices ;250 : : : ;1)
of the signal is positioned at time indices 252 : : : 501, as shown in the middle of
Figure 3.6. On the right the signal is zero-padded to a length of 1024. Notice that
the zeros are inserted in the middle of the wrapped signal.
3.4.2 Computation of the STFT
The STFT is composed as a series of DFTs computed on windowed signal portions
separated in time by the hop size parameter Nhop . A value varying between Nwin=2
and Nwin=16 is typically used for the hop size parameter, where Nwin is the length
of the time-domain analysis window.
A DFT is performed on the zero-phase-windowed and zero-padded signal xw (n).
The DFT returns a complex sequence Xw (k) of length of the original signal. Sequence Xw (k) is a frequency domain representation of the signal, and it is centered
around the frequency origin. As a result of the analyzed signal being real, the values
at positive and negative frequencies are complex conjugates, i.e., Xw (k) = Xw (;k).
In the following we will only consider the values of Xw (k) at positive frequencies.
Sequence Xw (k) is interpreted as magnitude and phase spectra of the windowed signal by changing to polar coordinates. An example of a single STFT frame is shown
in Figure 3.8 where the magnitude (top) and phase (bottom) spectra of a windowed
25
Chapter 3. Spectral Models
guitar signal are plotted.
3.4.3 Detection of the Peaks in the STFT
The peaks in the magnitude spectrum correspond to prominent signal components
that are modeled as sinusoidal signals. In general the determination of whether
a peak is a prominent one is rarely trivial. In the case of an harmonic tone, the
harmonic structure of the magnitude spectrum can be exploited. The fundamental frequency of the recorded signal is estimated and it suces to search for the
local maxima of each magnitude spectrum in the vicinities of the multiples of the
fundamental frequency.
A peak detection is best performed in the dB scale (Serra, 1989). A local maximum in the vicinity of a harmonic frequency can be detected by rst determining
the range of the search. Typically, a the peak corresponding to the kth partial is
searched for in the range (k ; 1=4)f^0 (k +1=4)f^0]. The maximum value in this range
is detected and if it is a local maximum, it is marked as a peak. This procedure is
carried out for every partial in every frame.
3.4.4 Removal of Components below Noise Threshold Level
The peaks detected in the previous section will contain values that do not correspond
to the sinusoidal components. It is therefore essential to remove the detected peaks
with values below a chosen noise threshold value. Typically, this value should be
frequency dependent. The noise level can be estimated in the recorded signals, e.g.,
in the pauses of speech.
If a single tone is analyzed, the sinusoidal components can be detected starting
from the end of the signal (Smith and Serra, 1987). Then, the amplitude values can
be set to a zero value before a distinctive signal component is found.
3.4.5 Peak Continuation
After the peaks below the noise threshold level are removed, a peak continuation
algorithm is utilized to produce the amplitude and frequency trajectories corresponding to the sinusoidal components of the original signal. It is assumed that the
sinusoids are fairly stationary between frames, and thus the algorithm assigns a peak
for an existing trajectory if their frequency values are close enough. A parameter for
the maximum frequency deviation fD between consecutive frames is used as a limiting criterion. If there is no existing trajectory for that component in the previous
frame, a new trajectory is started, i.e., it is born (McAulay and Quatieri, 1986).
This is done by creating a triplet in the previous frame with zero amplitude, the
same frequency, and a phase value that is computed by subtracting a phase shift in
one frame from the detected phase value. Similarly, if no peak matching an existing
trajectory is found, that trajectory is killed (McAulay and Quatieri, 1986). In this
case a triplet with zero amplitude, the same frequency, and a shifted phase value is
inserted in the next frame.
26
3.4. McAulay-Quatieri Algorithm
fD
fD
Frequency tracked
Frequency track killed
fD
Frequency track born
An example of the peak continuation algorithm, after (McAulay and
Quatieri, 1986). On the left a match is found, and the peak is assigned to the track.
In the middle no peak within the maximum deviation fD is found and the track is
killed. On the left, a peak is detected that does not match any peaks in the previous
frame and a track is born.
Figure 3.7:
An example of the nearest-neighbor peak continuation algorithm is illustrated
in Figure 3.7. On the left, a frequency value is detected that is within the frequency
deviation threshold fD and the peak is assigned to the corresponding track. In the
middle, no peak with a frequency value within the limit is found, and the track
is killed. On the right, a new track is born, i.e., a peak is detected that does not
correspond to any of the peaks in the previous frame.
3.4.6 Peak Value Interpolation and Normalization
A better frequency resolution of the peak detection can be obtained by applying
a parabolic interpolation scheme proposed by Smith and Serra (1987) and detailed
by Serra (1989). In parabolic interpolation a parabola is tted to the three points
consisting of the maximum and the adjacent values. A point corresponding to the
maximum value of the parabola is detected. The point yields the amplitude and the
frequency values of the corresponding peak. The phase value is detected in the phase
spectrum by interpolating linearly between the adjacent frequency points enclosing
the location of the peak.
An example of detection of the peaks is shown in Figure 3.8. The peaks above
-60 dB in magnitude are detected and denoted with a cross () in the magnitude
and phase spectra. A zoom to the phase spectrum in Figure 3.9 shows the eciency
of the zero-phase windowing. The phase values are almost constant in the vicinity of
an harmonic component. This greatly reduces the estimation error of the detected
phase value.
The eect of the time-domain windowing has to be compensated for in the amplitude values. The normalization factor cw of the window function can be computed
by solving (Serra, 1989)
cw
1
X
;1
w(n) = cw
NX
;1
m=0
w(m) = 1
27
Chapter 3. Spectral Models
Magnitude(Db)
0
−20
−40
−60
−80
0
2000
4000
6000
8000
10000
2000
4000
6000
Frequency
8000
10000
0
−2
−4
−6
−8
−10
−12
0
Magnitude (top) and phase (bottom) spectra corresponding to a frame
in STFT. The locations of peaks corresponding to harmonic components are denoted
with . A zoom to the dashed box in the phase spectrum is shown in Figure 3.9.
Figure 3.8:
4
Phase
2
0
−2
0
500
1000
1500
Frequency (Hz)
A detail of phase spectra in a STFT frame. Zero-phase windowing
yields at portions of the phase spectrum in the vicinity of a harmonic component.
Figure 3.9:
28
3.4. McAulay-Quatieri Algorithm
l
k
l
ωk
φ
l
Ak
Instantaneous
phase interpolation
instantanious
phases
Amplitude
interpolation
Amplitude
envelopes
Additive
synthesis
y ( n)
Additive synthesis of the sinusoidal signal components, after (McAulay
and Quatieri, 1986). Linear interpolation is used for the amplitude envelope and
cubic interpolation for the instantaneous phase of each partial
Figure 3.10:
which yields
cw = N ;1 1 :
X
w(m)
(3.3)
m=0
Furthermore, in the DFT half of the energy of each sinusoid is on the negative
frequencies and thus the amplitude value of the detected peak has to be multiplied
by a factor of 2. The overall normalization factor is thus
2 c=
NX
;1
m=0
w(m)
where w(m) the window function of length N .
3.4.7 Additive Synthesis of Sinusoidal Components
The additive synthesis of the sinusoidal signal components is pictured in Figure
3.10. In this case a phase-included additive synthesis is used, i.e., the signal is
approximated as
x(n) x~(n) =
NX
sig (n)
k=1
A~k (n) cos(~k (n))
(3.4)
where A~k (n) is the amplitude envelope and ~k (n) is the instantaneous phase of the
kth signal component. Notice that the number of signal components Nsig (n) may
depend on time n. This implies that the number of signal components adapts to
the analyzed signal.
The analysis stage provides the amplitude, the frequency, and the phase trajectories of the signal components. The values of each triplet fAlk !kl kl g correspond
to the detected values of amplitude, frequency and phase of the kth signal component at frame l. They are separated in time by an amount determined by the hop
size parameter Nhop of the STFT. The trajectories have to be interpolated from
frame to frame in order to obtain the amplitude envelopes and the instantaneous
29
Chapter 3. Spectral Models
phases for additive synthesis. Amplitude trajectory Alk of the kth signal component
is interpolated linearly from frame l ; 1 to frame l to obtain instantaneous amplitude
l;1
l
(3.5)
A~k (m) = Alk;1 + Ak N; Ak m m = 0 1 : : : Nhop ; 1
hop
This procedure is applied to all frame boundaries to obtain the amplitude envelopes
A~k (n) for the additive synthesis.
Both the detected frequency and phase aect the instantaneous phase ~k (m).
Thus there are four variables, namely, !kl;1, 'lk;1, !kl , and 'lk , that have to be
involved in the interpolation. As proposed by McAulay and Quatieri (1986), cubic
interpolation can be used with
~k (m) = + m + m2 + m3:
(3.6)
This equation is solved as
~k (m) = 'lk;1 + !kl;1m + m2 + m3
where
(M ) "
(M ) =
N
3
2
hop
;32
Nhop
;1
Nhop
1
2
Nhop
#
kl ; kl;1 ; !kl;1T + 2M
!kl ; !kl;1
(3.7)
(3.8)
The value of M is chosen so that the instantaneous phase function is maximally
smooth. This is done by taking M to be the integer value closest to x, when
(McAulay and Quatieri, 1986)
x = 21 kl;1 + !kl;1 ; kl + N2hop (!kl ; !kl;1)]
(3.9)
The instantaneous phase ~k (m) is obtained by applying Equation 3.7 to all frame
boundaries. The synthetic signal is now computed as
xsin(n) =
NX
sin (n)
k=1
A~k (n) cos(~k (n)):
(3.10)
The residual signal corresponding to the stochastic component (Serra, 1989) is
obtained as
xres (n) = x(n) ; xsin (n):
(3.11)
The stochastic signal contains information on both the steady-state noise and rapid
transients in the signal.
3.5 Spectral Modeling Synthesis
The Spectral Modeling Synthesis (SMS) technique was developed in the late 1980`s
at CCRMA, Stanford University. Serra (1989) developed a method for decomposing
30
3.5. Spectral Modeling Synthesis
a sound signal into deterministic and stochastic components. The deterministic component can be obtained by using the MQ algorithm (McAulay and Quatieri, 1986),
(Section 3.4) or by using a magnitude-only analysis. The deterministic part is subtracted from the original signal either in the time or the frequency domain to produce
a residual signal which corresponds to the stochastic component. The residual signal can be represented eciently using methods discussed in this section. In (Serra
and Smith, 1990) a detailed discussion of the magnitude-only analysis/synthesis is
given and a description of that system will be given here. The method is also discussed in (Serra, 1997b). The analysis scheme with phase included can be used to
obtain the residual signal by a time-domain subtraction, as discussed in Section 3.4.
This method is used in various musical analysis applications, including analysis of
recorded plucked string tones to derive proper excitation signals for physical modeling of plucked string tones (Tolonen and Vlimki, 1997). The interested reader is
also referred to work by Evangelista (1993, 1994) where a wavelet representation is
introduced that is suitable for representing separately pseudo-periodic and aperiodic
components of a signal.
The SMS technique is based on the assumption that the input sound can be
represented as a sum of two signal components, namely, the deterministic and the
stochastic component. By denition a deterministic signal is any signal that is fully
predictable. The SMS model, however, restricts the deterministic part to sinusoidal
components with piecewise linear amplitude and frequency variations. This aects
the generality of the model and some sounds cannot be accurately modeled by
the technique. In the method the stochastic component is described by its power
spectral density. Therefore, it is not necessary to preserve phase information of the
stochastic component. The stochastic component can be eciently represented by
the magnitude spectrum envelope of the residual of each DFT frame.
The SMS model consists of an analysis part and a synthesis part described in
the following two subsections.
3.5.1 SMS Analysis
The analysis part is used to map the input signal from the time domain into the
representation domain, as is depicted in Figure 3.11. The stochastic representation is
given by the spectral envelopes of the stochastic component of the input signal. The
envelopes are calculated from each DFT frame and they can be eciently described
using a piece-wise linear approximation (Serra and Smith, 1990). The deterministic
representation is composed of two trajectories, the frequency and the magnitude
trajectory.
The analysis part is fairly similar to that of the MQ algorithm. The rst step is
to calculate the STFT of each windowed portion of the signal. The STFT produces a
series of complex spectra from which the magnitude spectra is calculated. From each
spectrum the prominent peaks are detected and the peak trajectories are obtained
utilizing a peak continuation algorithm.
The stochastic component is obtained by subtracting the deterministic compo31
Chapter 3. Spectral Models
input
signal
STFT
time
domain
deterministic
representation
Peak
detection
frequency
domain
Additive
synthesis
time domain
STFT
Figure 3.11:
representation
domain
stochastic
representation
envelope
approximation
The analysis part of the SMS technique, after (Serra and Smith, 1990).
nent from the signal in the frequency domain. First, the deterministic waveform is
computed from the peak trajectories. Then the STFT of the deterministic waveform
is calculated similarly to the one obtained from the original signal. By calculating
the dierence of the magnitude spectra of the input and the deterministic signal,
the corresponding magnitude spectrum of the stochastic component is obtained for
each windowed waveform portion. The envelopes of these spectra are then approximated using a line-segment approximation. These envelopes form the stochastic
representation.
3.5.2 SMS Synthesis
The synthesis part of the technique maps a signal from the representation domain
into the time domain. This process is illustrated in Figure 3.12. The deterministic
component of the signal is obtained by a magnitude-only additive synthesis. An
optional transformation can be used to alter the synthesized signal. This allows
the production of new sounds using the information of the analyzed signal, for
example, the duration of the signal (tempo) can be varied without changing the
peak frequencies (key) of the signal. Similarly, the frequencies can be transposed
without inuencing the duration.
The stochastic signal is computed from the spectral envelopes, or their modications, by calculating an inverse STFT. The phase spectra are generated using a
random number generator.
The SMS method is very ecient in reducing the control data and computational
demands. The method is general and can be applied to many sounds. There are
some problems with the use of STFT is not suciently well time-localized and short
transient signals will be spread in the time domain (Goodwin and Vetterli, 1996).
In the next section, a method for improving the accuracy on transient signals is
presented.
32
3.6. Transient Modeling Synthesis
Peak
trajectories
Transformations
Additive
synthesis
time domain
Synthesized
waveform
representation domain
Spectral envelopes
Transformations
magnitude
Random number
generator
phase
Figure 3.12:
1990).
Polar to
rectang.
inverseSTFT
frequency domain
The synthesis part of the SMS technique, after (Serra and Smith,
3.6 Transient Modeling Synthesis
An extension to Spectral Modeling Synthesis discussed in the previous section is
presented by Verma et al. (1997). In this approach, the residual signal obtained
by subtracting the sinusoidal model from the original signal is represented in two
parts, transients and steady noisy components. Transient Modeling Synthesis (TMS)
provides a parametric representation of the transient components.
TMS is based on the duality between the time and the frequency domains (Verma
et al., 1997). Transient signals are impulsive in the time domain, and thus they are
not in a form that is easily parameterizable. However, with a suitable transformation, impulsive signals are presented as frequency domain signals that have a
sinusoidal character. This implies that sinusoidal modeling can be applied in the
frequency domain to obtain a parametric representation of the impulsive signal.
In the next subsection, the principles utilized in TMS are presented. Second, the
structure of the TMS system is described.
3.6.1 Transient Modeling with Unitary Transforms
The idea is to apply sinusoidal modeling on a frequency domain signal, that corresponds to rapid changes in the time domain signal. For sinusoidal modeling we wish
to have a real-valued signal. Thus, in this case the DFT is not an appropriate choice
since it produces a complex-valued spectrum. The discrete cosine transform (DCT)
provides a mapping in which an impulse in the time domain maps into a real-valued
sinusoid in the frequency domain. The DCT is dened as
(2n + 1)k N
;1
X
C (k) = (k) x(n) cos
n k 2 0 1 2 : : : N ; 1
(3.12)
2N
n=0
where N is the length of the transformed signal x(n). Coecients (k) are
8q 1
< N fork = 0
(k) = :q 2
N fork = 1 2 : : : N ; 1
(3.13)
33
Chapter 3. Spectral Models
It is obvious from Equation 3.12 that if x(n) = (n ; l), the frequency-domain
representation is
(2l + 1) C (k) = cos 2N k i.e., it is a sinusoid with a period depending on the location l of the time-domain
impulse (n ; l). Thus, Equation 3.12 implies that impulsive time-domain signals,
e.g., corresponding to attacks of tones, produce a DCT that has strong sinusoidal
components, whereas steady-state signals produce a DCT with little or no sinusoidal
components.
Equation 3.12 is exemplied in Figures 3.13 and 3.14. On the top and in the
middle of Figure 3.13 an impulse-like time-domain signal and its DCT are illustrated, respectively. The DCT is clearly a sinusoid with an amplitude envelope that
varies with frequency. The waveform of the DCT can be represented by applying
sinusoidal modeling. Notice that in this case the sinusoidal analysis is performed
on a frequency domain signal. On the bottom of Figure 3.13, the magnitude of the
complex-valued DFT computed on the sinusoidal DCT is presented. Notice that
only values corresponding to the positive indexes of the DFT are shown. This plot
shows that the period of the DCT corresponds to the location of the impulse.
To demonstrate the duality principle applied in TMS, similar plots corresponding
to a slowly-varying signal are presented in Figure 3.14. In this case, an exponentially
decaying sinusoid (top) produces an impulsive DCT (middle). Again, the magnitude
of the DFT (bottom) computed on the DCT closely follows the amplitude envelope
of the original signal. Observe that in both Figures 3.13 and 3.14 the DFT does
not provide a parametric representation of the transients in the residual signal. The
magnitude plots are only shown to clarify the unitary transforms applied in the
TMS.
3.6.2 TMS System
As mentioned above, TMS is an extension to the SMS system discussed in Section 3.5
in that the residual signal is further decomposed into two components corresponding
to transient and noisy parts of the original signal. In this context, only the extension
part of the TMS is presented. A block diagram of the system is illustrated in Figure
3.15 (Verma et al., 1997). First, a block DCT is computed on the residual signal.
The length of the DCT block is chosen to be suciently large so that the transients
are compact entities within the block. A block size of one second has found to be
a good choice (Verma et al., 1997). The transient detection block is optional and it
can be used to determine the regions of interest in the sinusoidal analysis. The SMS
is applied to the frequency domain DCT signal, and the obtained representation is
used to synthesize the transients and subtract them from the residual signal in the
time domain. The residual signal is now expressed as components corresponding to
slowly-varying noise and transients. The analysis steps are elaborated further in the
following discussion.
The transient detection block is optional and the system can operate without it.
However, it is useful since if the approximate locations of the transients in the time
34
3.6. Transient Modeling Synthesis
1
Amplitude
0.8
0.6
0.4
0.2
0
0
50
100
150
Time (samples)
200
250
50
100
150
Frequency (DCT bins)
200
250
Amplitude
0.4
0.2
0
−0.2
−0.4
0
10
Magnitude
8
6
4
2
0
0
20
40
60
80
Pseudo time (DFT bins)
100
120
An example of TMS. An impulsive signal (top) is analyzed. A DCT
(middle) is computed, and an DFT (magnitude in bottom) is performed in the DCT
representation.
Figure 3.13:
35
Chapter 3. Spectral Models
Amplitude
1
0.5
0
−0.5
−1
0
50
100
150
Time (samples)
200
250
50
100
150
Frequency (DCT bins)
200
250
3
Amplitude
2
1
0
−1
−2
0
Magnitude
15
10
5
0
0
20
40
60
80
Pseudo time (DFT bins)
100
120
An example of TMS. A slowly-varying signal (top) is analyzed. A
DCT (middle) is computed, and an DFT (magnitude in bottom) is performed in the
DCT representation.
Figure 3.14:
36
3.6. Transient Modeling Synthesis
Residual
from SMS
Sinusoids
from SMS
Transient
detection
Sinusoidal
analysis/
synthesis
Block
DCT
Representation
of transients
Block
IDCT
Noise
analysis
Representation
of noise
A block diagram of the transient modeling part of the TMS system,
after (Verma et al., 1997).
Figure 3.15:
domain are known, the sinusoidal modeling operating on the DCT can be restricted
to only select those components that correspond to the transient positions. The
transients are detected in the residual signal by computing a ratio of the energies
of the residual and sinusoidal signals as a function of time (Verma et al., 1997).
In practice this is done within a DCT block by rst computing the energies of the
sinusoidal and residual signals as
Esin =
NX
;1
n=0
jxsin j and Eres =
(n) 2
NX
;1
n=0
jxres(n)j2
(3.14)
where N is the length of the DCT. The instantaneous energies of the sinusoidal
and the residual signal are approximated by computing the energy within a short
window that is slid in time within the DCT block. This is expressed as
esin(k) =
kX
+ L2
n=k; L
jxsin(n)j2
(3.15)
jxres(n)j2
(3.16)
2
and
eres(k) =
kX
+ L2
n=k; L
2
37
Chapter 3. Spectral Models
for k = 0 Nhop 2Nhop ::: N ; 1 where L is the length of the sliding window, Nhop
is the hop size parameter, and x(n) is the signal within the DCT block zero-padded
in a manner that it is dened in the region of computation.
The locations of the transients are determined to be in the vicinity of positions
k where the ratio of normalized instantaneous energies of the residual and the sinusoidal signal is above a given threshold value. This is expressed explicitly as
eres(k)=Eres > R
(3.17)
thr
esin(k)=Esin
After the locations of the transients have been detected, the sinusoidal model
is restricted to estimating periodic spectral components corresponding to the estimated locations. If the transient detection is not used, SMS is applied on the whole
period range of the spectral representation. The spectral modeling parameters are
used to resynthesize the transient signal components, and subtract them from the
residual signal. The obtained signal lacks the rapid variations and can therefore be
approximated as slowly-varying noise.
3.7 Inverse FFT (FFT;1) Synthesis
Inverse FFT (FFT;1) synthesis is presented in (Rodet and Depalle, 1992a) and
(Rodet and Depalle, 1992b). In this method, additive synthesis is used in the
frequency domain, i.e., all the signal components are added together as spectral
envelopes composing a series of STFT frames. The waveform can be constructed
by calculating the inverse FFT of each frame. The overlap-add method is used to
attach the consecutive frames to each other.
Sinusoidal signals are simple to represent in the frequency domain. A windowed
sinusoid in the frequency domain is a scaled and shifted version of the DFT of the
window function. For the synthesis method to be computationally ecient, the
DFT of the windowing function should have low sidelobes, i.e., it should have few
signicant values (Rodet and Depalle, 1992a). On the other hand, the frequency
and the amplitude of the sinusoid in consecutive frames are linearly interpolated.
This requirement yields a triangular window. The DFT of a triangular window,
however, has quite signicant sidelobes and is not appropriate. A solution to this
problem is to use two windows, one in the frequency domain and one in the time
domain (Rodet and Depalle, 1992a).
Using the FFT;1 synthesis, quasiperiodic signals can be easily composed. The
parameters, namely, the frequency and the amplitude, are intuitive, although it is
useful to apply higher level controls in order to eciently create complex sounds
with many partials. It is straightforward to add additional noise of arbitrary shape
in the frequency domain representation. This is done by adding STFTs of desired
noise in frequency domain representation of the signal under construction (Rodet
and Depalle, 1992a).
38
3.8. Formant Synthesis
There are several methods to improve the problems which arise mainly from the
interpolation between consecutive frames. These are discussed in (Goodwin and
Rodet, 1994) and (Goodwin and Gogol, 1995).
3.8 Formant Synthesis
In many cases it is useful to inspect spectral envelopes, i.e., a more general view of
the spectra instead of the ne details provided by the Fourier transform. A central
concept of spectral envelopes is a formant, which corresponds to a peak in the
envelope of the magnitude spectrum. A formant is thus a concentration of energy
in the spectrum. It is dened by its center frequency, bandwidth, amplitude, and
envelope. Formants are useful for describing many musical instrument sounds but
they have been used extensively for synthesis of speech and singing. See (Roads,
1995) for more details and references on the use of formants in sound synthesis.
In this section two sound synthesis methods based on formants are discussed.
Formant Wave-Function synthesis is used in the CHANT program to produce high
quality synthetic singing. VOSIM is a method for creating synthetic sound by trains
of pulses of a simple waveform. These methods can also be interpreted as granular
synthesis methods as both of them use short grains of sounds to produce the output
signal. See Section 2.6 for a discussion and references of granular synthesis methods.
3.8.1 Formant Wave-Function Synthesis and CHANT
The formant wave-function synthesis has been developed at IRCAM, Paris, France
(Rodet, 1980). The method starts from the premise that the production mechanism
of many of the real-world sound signals can be presented as an excitation function
and a lter (Rodet, 1980). The method assumes that the lter is linear and the
excitation signal is composed of pulses of impulses or arches. The fundamental frequency of the tone is then readily determined as the period of the train of excitation
pulses. In general, the response of the lter can be interpreted as a sum of responses
of a set of parallel lters each of which corresponds to a formant in the synthesized waveform. The impulse responses of the parallel lters can be determined by
analyzing one period of a recorded signal by linear prediction (Rodet, 1980).
The main elements of the formant wave-function synthesis are the formant wavefunctions (French: fonction d'onde formantique, FOF) described by Rodet (1980).
Each FOF corresponds to a formant or a main mode of the synthesized signal and it
is obtained by analyzing a recorded signal as explained above. FOFs are computed
in the time domain. A typical FOF (s(k)) is pictured in Figure 3.16 and it can be
written as
s(k) = 0 for k < 0
s(k) = 12 1 ; cos(k)]e k sin(!k + ) for 0 k =
s(k) = e k sin(!k + ) for k > =
(3.18)
39
Chapter 3. Spectral Models
1
0.8
0.6
0.4
Amplitude
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
0.1
0.2
0.3
0.4
Figure 3.16:
0.5
Time
0.6
0.7
0.8
0.9
1
A typical FOF.
where ! is the center frequency, is the 3 dB bandwidth, parameter governs
the skirt width, and is the initial phase. Naturally, the amplitude of the FOF can
also be modied.
A FOF synthesizer is constructed by connecting FOF generators in parallel. The
synthesizer can be controlled via instructions from the CHANT program. The user
can utilize high-level commands and achieve comprehensive control without having
to adjust the low-level parameters directly.
The CHANT program was originally written to produce high-quality singing
voices, but it can also be employed to synthesize musical instruments (Rodet et
al., 1984). It employs semiautomatic analysis of the spectrum of recorded sounds,
extraction of gross formant characteristics, and fundamental-frequency estimation
(Rodet, 1980). The program is discussed in detail in (Rodet et al., 1984). Sound
examples of synthesized singing can be found in (Bennett and Rodet, 1989).
3.8.2 VOSIM
VOSIM (VOice SIMulation) is developed by Kaegi and Tempelaars (1978). It starts
from the idea of presenting a sound signal as a set of tone bursts that have a variable
duration and delay.
The pulses used in VOSIM are of xed waveform. The VOSIM time function
consists of N pulses that are shaped like squared sinusoids. The pulses are of
equal duration T with decreasing amplitude (starting from value A) and followed
by a delay M . Each pulse is obtained from the previous pulse by multiplying with
a constant factor b. Such a time function is pictured in Figure 3.17. The ve
40
3.8. Formant Synthesis
parameters presented above are the primary parameters of VOSIM. For vibrato,
frequency modulation, and noise sounds the delay M is modulated. Three more
parameters are required: S is the choice of random or sine wave, D is the maximum
deviation of M , and NM is the modulation rate. Four additional variables allow
for transitional sounds: NP is the number of transition periods, and DT , DM , and
DA the positive or negative increments of T , M , and A, respectively.
1
0.9
0.8
Normalized amplitude
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Figure 3.17:
T = 10 ms.
20
40
60
Time (ms)
The VOSIM time function. N
80
= 11
100
, b = 0:9, A = 1, M
120
= 0
, and
41
Chapter 3. Spectral Models
42
4. Physical Models
Physical modeling of musical instruments has evolved to one of the most active
elds in sound synthesis, musical acoustics, and computer music research. Physical
modeling applications gain popularity by giving users better tools for controlling
and producing both traditional and new synthesized sounds. The user is provided
with a sense of a real instrument.
The aim of a model is to simulate the fundamental physical behavior of an actual
instrument. This is done by employing the knowledge of the physical laws that
govern the motions and interactions within the system under study, and expressing
them as mathematical formulae and equations. These mathematical relationships
provide the tools for physical modeling.
There are two main motivations for developing physics-based models. The rst
is that of science, i.e., models are used to gain understanding of physical phenomena.
The other is production of synthesized sound. From the days of rst physics-based
models researchers and engineers have utilized them for sound synthesis purposes
(Hiller and Ruiz, 1971a).
Physical modeling methods can be divided into ve categories (Vlimki and
Takala, 1996).
1. Numerical solving of partial dierential equations
2. Source-lter modeling
3. Vibrating mass-spring networks
4. Modal synthesis
5. Waveguide synthesis
Waveguide synthesis is one of the most widely used physics-based sound synthesis methods in use today. It is very ecient in simulating wave propagation in
one-dimensional homogeneous vibratory systems. The method is very much digital
signal processing oriented and a number of real-time implementations using waveguide synthesis exists. Waveguide synthesis and single delay loop (SDL) models are
discussed further in Chapter 5.
43
Chapter 4. Physical Models
Source-lter models have been used especially for modeling the human sound
production mechanism. The interaction of the vocal chords and the vocal tract is
modeled as a feedforward system. Eective digital ltering techniques for sourcelter modeling have been developed especially for speech transmission purposes.
The technique is basically a physical modeling interpretation of the source-lter
synthesis presented in Section 3.3.
The modeling methods simulate the system either in the time or the frequency
domain. The frequency domain methods are very eective for models of linear
systems. Musical instruments cannot in general be approximated accurately as
being linear. Nonlinear systems make the frequency-domain approach infeasible.
All the methods presented here model the system under study in the time domain.
This chapter starts with describing three physical modeling methods that use
numerical acoustics. First, models using nite dierence methods are represented.
Applications to string instruments as well as to mallet percussion instruments are
presented. Second, modal synthesis is discussed. Third, CORDIS is a system of
modeling vibrating objects by mass-spring networks.
The interested reader is also referred to an interesting web site by De Poli and
Rocchesso: http://www.dei.unipd.it/english/csc/papers/dproc/dproc.html
4.1 Numerical Solving of the Wave Equation
In this section, modeling methods based on nite dierence equations will be discussed. These methods have been used especially for string instruments.
The method is in general applicable to any vibrating object, i.e., a string, a
bar, a membrane, a sphere, etc. (Hiller and Ruiz, 1971a). The basic principle is
to obtain mathematical equations that describe the vibratory motion in the object
under study. These wave equations are then solved in a nite set of points in
the object, thus obtaining a dierence equation. The use of dierence equations
leads to a recurrence equation that can be interpreted as a simulation of the wave
propagation in the vibrating object. The nite dierence method is computationally
most ecient with one-dimensional vibrators as the computational demands rapidly
increase with introduction of more dimensions. The number of points in space
increases proportional to the power of the number of dimensions. Furthermore, the
number of computational operations for each point is increased, and the eective
sampling frequency is increased. Digital waveguide meshes presented in Section 5.2
are DSP formulations of dierence equations in two and three dimensions.
Hiller and Ruiz (1971a) were the rst to take the approach of solving the dierential equations of a vibrating string for sound synthesis purposes. They developed
models of plucked, struck, and bowed strings. The stiness of the string was modeled, as well as the frequency-dependent losses. Hiller and Ruiz (1971b) were able
to produce synthesized sound and plots of the resulting waveforms by a computer
program. Since that pioneer work developments have been made in modeling the
44
4.1. Numerical Solving of the Wave Equation
excitation, e.g., the interaction of the hammer and the piano strings, see (Chaigne
and Askenfelt, 1994a) for references.
More recently Chaigne has been studying nite dierence methods with applications to modeling of the guitar, the piano, and the violin (Chaigne et al., 1990),
(Chaigne and Askenfelt, 1994a), (Chaigne and Askenfelt, 1994b). He has taken
similar approaches to modeling a vibrating bar with application to the xylophone
(Chaigne and Doutaut, 1997).
In this section a dierence equation with initial and boundary conditions for a
damped sti string is rst introduced. Then similar treatment is given for a vibrating
bar. Finally, the synthesized waveforms are compared with original recordings of
real instrument sounds.
4.1.1 Damped Sti
String
The model of a vibrating string presented here includes modeling the stiness of
the string as well as the frequency-dependent losses due to the friction with air,
viscosity, and nite mass of the string. It describes the transversal wave motion of
the string in a plane. The wave equation for the model is (Chaigne and Askenfelt,
1994a)
@ 2 y = c2 @ 2 y ; "c2L2 @ 4 y ; 2b @y + 2b @ 3 y + f (x x t)
(4.1)
1
3 3
0
@t2
@x2
@x4
@t
@t
where y is the displacement of the string, x the axis along the string, c the transverse
wave velocity of the string, " the stiness parameter, L the string length, b1 and b3
the loss parameters, and f (x x0 t) the excitation acceleration applied at point x0.
The excitation term is actually a force density term normalized by the mass density
of the string so that the term will give acceleration of the string at point x.
The stiness parameter " is given by
ES (4.2)
" = 2 TL
2
where is the radius of gyration of the string, E the Young's modulus, S the area
of the string cross section, and T the string tension.
In Equation 4.1, the two partial time derivative terms of odd order model the
frequency-dependent losses, i.e., the decay of the vibration. The decay is an eect of
several physical phenomena. The eect of each phenomenon can be hard to separate
from the total decay, and it will not be attempted here. However, some qualitative
interpretations can be made. In the low-frequency range, the main causes for losses
are the air resistance and the resistive impedances at the ends of the string (Chaigne,
1992). In the high-frequency range, the damping is mainly created by internal losses
in the string, such as the viscoelastic losses in nylon strings discussed by Chaigne
(1991). The parameters for the losses, b1 and b3 , are obtained via the analysis of
real instrument tones. The model does not try to model the individual physical
processes that cause the dissipation of energy separately. The frequency-dependent
45
Chapter 4. Physical Models
decay rate is given by
= 1 = b1 + b3 !2:
(4.3)
The string is excited by the force density term f (x x0 t). It is assumed that the
force density does not propagate along the string, thus time and space dependence
can be separated in order to get
f (x x0 t) = fH (t)g(x x0):
(4.4)
The term g(x x0 ) can be understood as a spatial window which distributes the
excitation energy to the string. This window smoothes the applied excitation, e.g.,
hammer strike on a piano string, so that artifacts that occur in the solution because
of discontinuities will be eliminated.
The force density term fH (t) is related to the force FH (t) exerted in the excitation
by
fH (t) = R x0+xFH (t)
(4.5)
x0 ;x g(x x0)dx
where the eective length of the string section interacting with the exciter is 2x,
and is the linear mass density of the string.
4.1.2 Di
erence Equation for the Damped Sti
String
The dierence equation for the sti damped string is obtained by discretizing the
time and space by taking (Chaigne and Askenfelt, 1994a)
(4.6)
xk = kx k 2 0 Lx ]
and
tn = nt n = 0 1 2 : : :
(4.7)
The t and x are related by
t = r 1:
c
x
The condition r = 1 gives the exact solution with no numerical dispersion (Chaigne,
1992). However, r equals unity only in the case of an ideal string. For values
r < 1 numerical dispersion will be present in the model. This will not be discussed further in this context, see (Chaigne, 1992) for more details. The main
variable of interest will be the discretized transversal string displacement denoted
y(k n) = y(kx nt) for convenience. The derivation of the dierence equation
approximating Equation 4.1 is given by Hiller and Ruiz (1971a) and will not be
46
4.1. Numerical Solving of the Wave Equation
repeated here. However, it should be noted that for the sake of eciency in computation one further simplication is made. The third order time derivative term in
Eq. 4.1 would yield the following approximation with time tn = n as central point:
@ 3 y y(k n + 2) ; 2y(k n + 1) + 2y(k n ; 1) ; y(k n ; 2)
(4.8)
@t3
i.e., the need for implicit methods.
This can be overcome by noticing that the
magnitude of the term 2b3 @@t33y is relatively small, and by reducing the number of
time steps by employing the recurrence equation for the ideal string:
y(k n + 1) = y(k + 1 n) + y(k ; 1 n) ; y(k n ; 1):
(4.9)
Using this equation to simplify Eq. 4.8 will not increase the number of time or space
steps involved in the recurrence equation. The general recurrence equation is now
given by
y(k n + 1) = a1 y(k n) + a2 y(k n ; 1)
+a3 y(k + 1 n) + y(k ; 1 n)]
+a4 y(k + 2 n) + y(k ; 2 n)]
+a5 y(k + 1 n ; 1) + y(k ; 1 n ; 1) + y(k n ; 2)]
+t2 NFH (n)g(k i0)]=MS (4.10)
where the coecients a1 to a5 are given with Equations 4.11.
a1 = (2 ; 2r2 + b3 =t ; 6"N 2r2)=D
a2 = (;1 + b1 t + 2b3 =t)=D
a3 = r2(1 + 4"N 2)=D
a4 = (b3 =t ; "N 2 r2)=D
a5 = (;b3 =t)=D
where
D = 1 + b1 t + 2b3=t and r = ct=x
(4.11)
Figure 4.1 shows the how displacement y(k n + 1) depends on previous values of
displacement when using Eq. 4.10. This equation can be directly utilized to compute
the displacements of the chosen discrete points on the string.
4.1.3 The Initial Conditions for the Plucked and Struck String
The initial conditions are given for models of the guitar and the piano. The conditions are very dissimilar and correspond to the excitation by either plucking or
striking the string.
Plucked String
Excitation by plucking is the simplest case and its initial conditions are given directly
by Equation 4.5. The spatial window g(x x0) and the string section aected are
47
Chapter 4. Physical Models
n +1
n
n -1
t
n -2
k -2 k -1
k
k +1 k +2
x
Dependence of displacement y(k n + 1) on previous values of the displacement, after (Chaigne, 1992). The next value marked with will depend on
those points in time and space marked with .
Figure 4.1:
determined mainly by the type of the pluck. The velocity of the pluck determines the
time distribution of the excitation. Naturally, these are not mutually independent.
It may be helpful to consider the plucking event as being mapped to a force density
distribution that can then be separated to parts depending on space and time.
A more detailed model of the plucking event including modeling the nger-string
interaction is given by Chaigne (1992).
For the guitar, the initial condition is introduced by rewriting the last term on
the right hand side of Eq. 4.10
(4.12)
t2 mN F (n)g(k i0)
S
where N is the number of points on the string, mS is the mass of the string, F (n) is
the force applied by nger or plectrum, and g(k i0) is the discretized spatial window
(Chaigne et al., 1990).
Struck String
For the development of an expression of the initial conditions for the piano an
assumption of zero initial velocity and displacement of the string is made by Chaigne
and Askenfelt (1994a). This assumption is made only for the sake of simplicity in
discussing the initial condition, the model has no restrictions on the initial condition.
With the string at rest at t = 0 we have
y(k 0) = 0:
One further assumption is needed for Equation 4.10 to be applicable to the rst
three time steps. The calculation involves the states of the string at three past time
48
4.1. Numerical Solving of the Wave Equation
steps. Thus y(k 1) is estimated by using approximated Taylor series to obtain
y(k 1) = y(k + 1 0) +2 y(k ; 1 0) :
Now for the displacement of the hammer at time n = 1 we calculate
(1) = VH 0 t
where VH 0 is the hammer velocity at t = 0, and for the force exerted by the hammer
FH (1) = K j(1) ; y(k ; 1 0)jp:
(4.13)
Note that the force term at t = 1 is computed using the initial velocity, i.e., a unit
delay is introduced in order for the force to be computable. Borin et al. (1997a)
propose a more elaborate method for eliminating delay-free loops in discrete-time
models. Interestingly, they also apply the method for modeling the hammer-string
interaction.
Continuing with the treatment of Chaigne and Askenfelt (1994a), the displacement y(k 2) is computed using a simplied version of Eq. 4.10
2 NF (1)
t
(4.14)
y(k 2) = y(k ; 1 1) + y(k + 1 1) ; y(k 0) + M H :
H
Here the stiness and damping terms are neglected in order to limit the space
and time dependence, i.e., no terms with n = 2 are included. For the hammer,
displacement (2) and force FH (2) are computed by
(2) = 2(1) ; (0) ; t MFHH (1)
FH (2) = K j(2) ; y(k0 2)jp:
(4.15)
The eect of the simplications is discussed by Chaigne and Askenfelt (1994a).
After the displacements y(k n) are known to rst three time samples, it is possible to start using the recurrence formula of Eq. 4.10 directly. The force FH (n) is
assumed to be known, and its eect for the string is taken into account until time
n when
(n + 1) < y(k0 n + 1):
After this the string is left free for vibrations unless recontact of the hammer is
modeled. The force density term f (x xo t) can be applied at the string at any time.
2
4.1.4 Boundary Conditions for Strings in Musical Instruments
Terminations of strings in musical instruments are not completely rigid. For example, in the guitar the bridge has a nite impedance, and the nger terminating the
string against the ngerboard is far from rigid. The boundary conditions are given
for the guitar, the piano, and the case of the violin is discussed briey.
The boundary conditions for plucked, bowed, and struck string instruments, like
the guitar, the violin, and the piano, can be described by one of the three models
49
Chapter 4. Physical Models
N-1
N
N+1
Bridge and
clamping
VBC
N-1
N
N+1
Bridge
GBC
N-1
N
Rigid end
support
PBC
x
Models for boundary conditions of string instruments, after (Chaigne,
1992). VBC: Violin-like boundary condition. GBC: Guitar-like boundary condition.
PBC: Piano-like boundary condition.
Figure 4.2:
presented in Figure 4.2 (Chaigne, 1992). Here point N on the string corresponds to
the point of the bridge. For the violin-like boundary conditions it is assumed that
the displacement y(N n) of the string is non-zero. Furthermore, the displacement
of the string at the point x = N + 1 is taken to be much smaller than at x = N .
If the distance between the bridge and the clamping position is greater than the
space-step, e.g., px, the boundary condition can be written as y(N + p n) = 0,
instead of y(N + 1 n) = 0. Then the expressions for the intermediate points would
have to be developed.
In the guitar-like boundary condition the string is clamped just behind the
bridge, so that the distance between the bridge and the clamping position is usually
above the audible wavelength range of a human. Thus, it can be assumed that
y(N n) = y(N + 1 n)
(4.16)
i.e., y(N n) denotes the displacement of the string as well as the resonating box at
the bridge. This allows for modeling of the coupling of the bridge and the resonating body by using measured values of the input admittance at the guitar bridge.
Modeling the resonances and the radiated sound pressure is discussed by Chaigne
(1992).
The piano string is assumed to be hinged at both ends yielding the following
boundary conditions (Fletcher and Rossing, 1991):
y(0 t) = y(L t) = 0
@ 2 y (0 t) = @ 2 y (L t) = 0
@x2
@x2
50
(4.17)
4.1. Numerical Solving of the Wave Equation
For the model of the piano the boundary conditions are obtained by discretizing
Eqs. 4.17 and they can be expressed as:
y(0 n) = y(N n) = 0
y(;1 n) = ;y(1 n) and y(N + 1 n) = ;y(N ; 1 n)
(4.18)
(4.19)
The string is coupled to the soundboard at point N . If the frequency-dependent
properties of the coupling eect are desired, the second condition in Eq. 4.19 can be
replaced with a dierence equation approximating the dierential equation governing
the coupling. Equation 4.19 is important for deriving expressions for string motion
at points y = ;1 and y = N + 1, for these points are not explicitly included in the
model. These points are needed for the calculation of the displacement of the string
at points y = 1 and y = N ; 1 because the dierential equation for the string is of
fourth order, i.e., the recurrence equation for point k depends on points k ; 2 and
k + 2.
4.1.5 Vibrating Bars
An approach similar to the case of string is taken by Chaigne and Doutaut (1997)
for the vibrating bar. Theoretical treatment of vibrating bars is given by, e.g., Morse
and Ingard (1968), and mallet percussion instruments are discussed by Fletcher and
Rossing (1991).
It is assumed that the vertical component w(x t) of the displacement of a xylophone bar is given by the two following equations:
@ ) @ 2 w(x t) M (x t) = EI (x)(1 + @t
(4.20)
@x2
and
@ 2 w(x t) = 1 @ 2 M (x t) ; @w(x t) ; w(x t) + f (x x t): (4.21)
B
0
@t2
S (x) @x2
@t
MB
M (x t) is the bending moment and I (x) the moment about the x axis. S (x t) is the
cross sectional area of the bar. E is the Young's modulus and the density of the
vibrating bar. The coecients and B account for losses. They are obtained by
analyzing the decay times of partials on real instruments. Estimation of the stiness
coecient is obtained by measuring the natural frequency of a spring-mass system
composed of the bar with mass MB , and the supporting cord.
The model for the interaction between the bar and the mallet is similar to the
one used for the hammer-string interaction in the piano model with force
fH (t) =
(4.22)
R xF0 +Mx(t)
S (x) x0;x g(x x0)dx
where S (x0 ) is the cross section of the bar at point x0 , and is the density of the
bar. The spatial smoothing of the impact is obtained by employing a spatial window
as in Eq. 4.4.
51
Chapter 4. Physical Models
The impact force is given by Eq. 4.13 with p = 3=2. The non-integer exponent
3=2 is now derived from the general theory of elasticity, as opposed to the case of the
piano where analysis of experimental data must be used, see (Chaigne and Doutaut,
1997, Appendix A) for derivation. The stiness coecient K is obtained by analysis
of experimental data.
This interaction model is able to simulate three important physical aspects of
the instrument:
1. The introduction of kinetic energy, localized in time and space, into the vibrating system.
2. The inuence of the initial velocity on both the contact duration and the
impact force due to the nonlinear force-deformation law. This determines the
spectrum of the tone.
3. The inuence of the stiness of the two materials in contact, which strongly
determines the tone quality of the initial blow, i.e., the attack.
These principles apply to the model of the piano as well.
For the numerical formulation of the xylophone model, the same principles as
in the case of the string are employed. However, the explicit computation scheme
already used in the guitar and piano models is applicable only to a simple case of a
uniform bar with constant cross-sectional area. This is the only model discussed in
this context. Chaigne and Doutaut (1997) discuss also the more demanding model
of a bar with a variable section.
The dierential equation for the uniform bar is
@ 2 w(x t) = ;a2 @ 4 w(x t) + @ 5 w(x t) ] ;
@t2
@x4
@[email protected]
(x t) ; w(x t) + f (x x t)
B @[email protected]
0
M
B
(4.23)
where a2 = EI=S .
The recurrence equation approximating Eq. 4.23 is given by (Chaigne and
Doutaut, 1997)
y(k n + 1) = c1 w(k n) + c2w(k n ; 1)
+c3w(k + 2 n) ; 4w(k + 1 n) ; 4w(k ; 1 n) + w(k ; 2 n)]
+c4w(k + 2 n ; 1) ; 4w(k + 1 n ; 1)
;4w(k + 1 n ; 1) + w(k ; 2 n ; 1)]
+c5FM (n)g(k i0)]
(4.24)
52
4.1. Numerical Solving of the Wave Equation
where
2
2
2
c1 = 2 ; 6r (1 +1 + ); (t!B ) c2 = ;1 +1++ 6r
2
2
N
c3 = ;r1(1++ ) c4 = 1r
c
5=
2
+
MB fs (1 + ) (4.25)
where
2
N
B
= fs
= M = 2f and r = a f L2
(4.26)
B
s
s
It can be shown that the explicit scheme remains stable if the number of spatial
points N is (Chaigne and Doutaut, 1997, Appendix B)
!B2
s
s
N MMAX = 43 f (1f
1 + fs )
where f1 is the frequency of the lowest partial. For wooden bars, the order of
magnitude for the term fs = 10;2. It can be seen that according to this stability
criterion the maximum number of spatial points is roughly proportional to the square
root of the sampling frequency. Thus, to double the spatial resolution, sampling
frequency of four times the original
p is required. Furthermore, it can be observed
that there is an asymptotic limit 43 =f1 for NMAX as fs increases. The maximum
spatial resolution obtained by using the sampling frequency of 192 kHz is equal to
1 cm.
A comparison of the original measured signals and those obtained with the model
of variable cross-sectional area is discussed in the next section.
4.1.6 Results: Comparison with Real Instrument Sounds
The models described in the previous sections have been evaluated and compared
to real instruments by Chaigne and Askenfelt (1994b), Chaigne et al. (1990), and
Chaigne and Doutaut (1997). This is important not only for the validation of the
models, but also for studying the contribution of each individual physical parameter
to the signal. Typically the eect of a single parameter on produced sound can be
hard to establish by observing the instrument or the produced sound.
Only a short qualitative comparison between measured and simulated signals is
given in this section. References to detailed presentations of each instrument are
given in the corresponding subsection.
The Piano
Chaigne and Askenfelt (1994b) give a detailed and systematic discussion on the
comparison of real signals to those obtained by simulation.
53
Chapter 4. Physical Models
The string velocities were computed for bass (C2), midrange (C4), and treble
(C7) tones. The overall result is that the model is capable of producing the waveforms quite well over the whole register of the piano, including the attack transients.
Some small discrepancies in the bass range can be caused by the non-rigid termination of a real piano string. The model does not attempt to take this phenomenon
into account.
The spectra of the string velocities with notes played at dierent ranges with
dierent dynamics show a good behavior of the model. The spectra show increased
spectral content with increased hammer velocity, as expected. Large and audible
dierences were observed above the rst 5-7 partials, although these discrepancies
had little eect to the waveforms.
The Guitar
The guitar and the corresponding nite dierence model are compared by Chaigne
et al. (1990). The waveforms of a vibrating guitar string were obtained by a simple
electrodynamic method. A concentrated magnetic eld was applied perpendicular
to the vibrating string. The generated voltage proportional to the string velocity at
the point of the magnetic eld was measured between the string ends.
It was observed that the measured and simulated waveforms were similar. Furthermore, the inuence of the body response was more clearly visible in the measured
signal.
The Xylophone
The measurements and comparison were conducted by Chaigne and Doutaut (1997).
In this case the acceleration of the mallet's head was measured. The corresponding
force signal was derived by multiplication by the equivalent mass of the mallet. The
acceleration of the chosen point on the bar was either measured with the help of an
accelometer or derived from the velocity signal obtained by a laser vibrometer.
Two dierent types of mallets were simulated: a soft mallet with rubber head,
and a hard mallet with boxwood head. For both mallets signals of weak (piano)
and strong (mezzo-forte) impact were measured and simulated. Three comparisons
were made: bar accelerations, impact forces, and bar acceleration spectra.
For a weak impact with a soft mallet the general shape and amplitude of the
waveform of the bar accelerations were similar. However, the upper partials seemed
to be damped more rapidly in the measured acceleration. For a strong impact both
the magnitude and the shape of the signals were very similar. The model seems to
work better with hard mallets because the bar acceleration waveforms show a good
match with both the weak and the strong impact.
The magnitude of the impact forces on the bar showed that the order of magnitude for both shapes and amplitudes are reproduced fairly well for the soft mallet.
54
4.2. Modal Synthesis
The impact durations were systematically shorter by approximately 20 %. With
hard mallets the impact durations were identical, as well as the shapes and magnitudes of the force signals.
The frequency-domain comparison of the bar acceleration signal showed again
better match with the hard mallet. The rst three partials were almost identical
with a discrepancy of less than or equal to 2 dB. With the soft mallet the third
partial is approximately 15 dB below the corresponding partial of the measured
signal.
For a detailed comparison and discussion on the cause of the discrepancies, see
(Chaigne and Doutaut, 1997).
4.2 Modal Synthesis
The modal synthesis method has been developed mainly at IRCAM in Paris, France
(Adrien, 1989), (Adrien, 1991). They have produced a commercial software application Modalys (Eckel et al., 1995), formerly called Mosac (Morrison and Adrien,
1993). With this application the user can simulate vibrating structures. The user
describes the structure under study for the program, and the program computes the
modal data and outputs the signal observed at a point dened by the user.
Modal synthesis is based on the premise that any sound-producing object can
be represented as a set of vibrating substructures which are dened by modal data
(Adrien, 1991). Substructures are coupled and they can respond to external excitations. These coupling connections also provide for the energy ow between
substructures. Typical substructures are:
bodies and bridges
bows
acoustic tubes
membranes and plates
bells
The simulation algorithm uses the information of each substructure and their interactions.
The method is general as it can be applied to structures of arbitrary complexity.
The computational eort needed increases rapidly with complexity thus setting the
practical limits of the method. Next, the formulation of modal data of a substructure
is presented. Then an application to real musical instrument is shortly discussed.
55
Chapter 4. Physical Models
4.2.1 Modal Data of a Substructure
The modal data for a substructure consists of the frequencies and damping coefcients of the structure's resonant modes and of the shapes of each of the modes
(Adrien, 1991). A vibrating mode is essentially a particular motion in which every
point of the structure vibrates with same frequency. It should be noted that an
arbitrary motion of a structure can be expressed as a sum of the contributions by
the modes as can be done by Fourier series expansion.
The modes are excited by an external force applied at a given point on the
structure. The excitation energy is distributed to the modes depending on the form
of the excitation. It is assumed that there exists no exchange of energy between the
modes. In practice, the vibration pattern is never fully described by a single mode,
but it is a sum of an innite series of vibrating modes. This accounts for an innite
number of degrees of freedom in a continuous structure. For numeric computation
of vibration of the structure to become realizable, the continuous structure must be
spatially divided into a nite set of points.
Given a set of N points on a structure a number of N modes can be represented.
Each mode is described by its resonant frequency m , and damping coecients m.
The N N modes' shape matrix mk ] describes the relative displacements of the
N points in each mode. Column m of the modes' shape matrix corresponds to the
contribution of mode m to the displacements of the N points. Each mode can then
be presented as a second-order resonator connected in parallel with the others as
pictured in Fig. 4.3.
The modal data can be obtained analytically for simple vibrating structures. The
expressions for the modal data for each mode can be obtained from the dierential
equation system governing the motion of the simple vibrating system. For complex
structures direct computation of modal data is not possible, and analysis based on
measuring experiments must be utilized. Modal analysis is used extensively in aircraft and car industry, and thus the tools are ecient and available. They typically
consist of excitation and pickup devices, signal processing hardware and software
for Fourier transforms and polynomial extraction of modal data. Similar methods
have been used for parameter calibration of other physical models (Vlimki et al.,
1996).
The method is similar for mechanical and acoustical systems. In mechanical
systems the deections in the modes' shape matrix are the actual displacements of
the points on the surface of the vibrating structure. In acoustical systems elements of
the modes' shape matrix correspond to the deections of sound pressure or particle
velocity.
4.2.2 Synthesis using Modal Data
Modal synthesis is very powerful in that all vibrating structures can be described
using the same equations. These equations describe the response of the structure to
an excitation applied at a given point. For a mechanical structure partitioned to N
56
4.2. Modal Synthesis
F, v
Φ
Point
mass
damper
spring
A modal scheme for the guitar. A complex vibrating structure is
represented as a set of parallel second-order resonators responding to the external
force F , and contributing to the resulting velocity v.
Figure 4.3:
points the equation for the instantaneous velocity of the kth point is (Adrien, 1991)
N
@ykt+1 = X
mk
@t
m=1
PP iF ext + @'m t 1 ; !2 '
m mt
l=1 l lt+1
@t t
1
t
+ 2!m m + !m2 t
(4.27)
where mk is the contribution of the mth mode to the deection of point k on the
structure, Fltext+1 is the instantaneous external force on point l of the structure, t is
the time step, and !m , m, and 'm the angular frequency, the damping coecient,
and the instantaneous deection associated with the mth mode.
A similar equation can be applied to acoustic systems with the external forces
ext
Flt+1 replaced by external ows Ultext+1 . The density of air is denoted by 0 . The
equation becomes
N
kt+1 X m
pkt+1 = 0 @[email protected]
= k
m=1
PP i U ext + @'m t 1 ; !2 '
m mt
l=1 l lt+1
@t t
1
t
+ 2!mm + !m2 t
(4.28)
If all instantaneous external excitations are known, the velocities of the modes
and thus the velocities of all of the points can be calculated. However, typically
only the excitation corresponding to control and driving data are known, and other
forces of ows have to be determined. These forces or ows implement the coupling,
i.e., the energy ow, between substructures. The couplings are often nonlinear.
The reed/air column interaction in woodwind instruments is an example of coupling
of linear systems governed by Equations 4.27 and 4.28 respectively. The coupling
57
Chapter 4. Physical Models
equation involves the ow entering the bore U0ext , and the pressure dierence between
the mouth and the bore Pm ; P0 , the position of the reed , the Backus constant
B , and the additional ow S0 due to the displacement of the reed. The interaction
shifts between two regions (Adrien, 1991)
p
3
2 4
Open reed U0ext
t+1 = B (pmt+1 ; p0t+1 ) + S0 t+1
U0extt+1 = 0
Closed reed
t+1 = 0
(4.29)
4.2.3 Application to an Acoustic System
When the modal synthesis method is applied to a simple acoustical system consisting
of a conical tube with a simple reed mouthpiece and ve holes, six equations of the
form of Eq. 4.28 are utilized, one for the cone, and one for each hole. The reed is
presented as a mechanical system with nonlinear coupling to the cone with Equations
4.28 and 4.29. The interactions between substructures involve ow conservation.
Using this principle, it is possible to eliminate all pressure terms from the equations
and present the equations in a 6 6 matrix form. For a description of the matrix
equations see (Adrien, 1991).
The modal synthesis method provides many possible output signals. It is interesting, at least for research purposes, to try to recreate the acoustic eld of a real
instrument. This can be done by utilizing a body of a real instrument to radiate
the created acoustic signal. In the case of the violin, the string, the bridge, and the
exciter is modeled as usual, but the body is replaced by an innite impedance in the
model. The sound outputs obtained at the foot of the bridge are used as force signals
to drive shakers at the foot of a real instrument bridge. The implicit assumption
made above is that the body does not act as a load for the strings and therefore
it does not aect the attenuation and phase of the partials in bow-string interaction (Rocchesso, 1998). Adrien (1991) gives a detailed discussion of the simulated
signals, but lacks comparison to measured real instrument signals.
4.3 Mass-Spring Networks the CORDIS System
Cadoz et al. (1983) attempt to model the acoustical system under study using simple
ideal mechanical elements, such as masses, dampers and springs. They aim to
develop a paradigm which can be applied to an arbitrary acoustic system. The
CORDIS system was the rst system capable of producing sound based on a physical
model in real time (Florens and Cadoz, 1991). In this section, rst the basic elements
of the system are described. Second, application to modeling a plucked string is
discussed.
4.3.1 Elements of the CORDIS System
The most primitive and fundamental elements of the system are the following:
58
4.3. Mass-Spring Networks the CORDIS System
1. point masses
2. ideal springs
3. ideal dampers
When these components are combined and connected in sucient number, reproduction of spatial continuum and an acoustical signal should be possible at a desired
sampling rate (Florens and Cadoz, 1991). The object under study is then approximated as a set of these elements discretely distributed over its surface. A major
simplication is obtained by taking each element as being one-dimensional, i.e., each
element can only move or act in one dimension. For modeling interactions that vary
in time, e.g., bowing, striking by a hammer, or plucking, a conditional link is introduced. It consists of a spring and a damper with adjustable parameters connected
in parallel.
The mathematical presentations for the elements are simple and they are given
by Florens and Cadoz (1991) with
Mass :
F = m @@t22x
Spring :
F1 = F2 = ;K (x1 ; x2 )
Damper :
F1 = F2 = ;Z ( @[email protected] ; @[email protected] )
Conditional link : F1 = F2 = ;K (x1 ; x2 ) ; Z ( @[email protected] ; @[email protected] )
(4.30)
(4.31)
(4.32)
(4.33)
where F is the force driving the mass, F1 and F2 are the forces at points x1 and x2
of the spring, damper, or the condition link, Z is the friction coecient, and K the
spring coecient.
The same equations may be obtained in discretized form by taking
@x(n) ;! x(n) ; x(n ; 1)
@t
and
@ 2 x(n) ;! @x(n) ; @x(n ; 1) @t2
@t
@t
thus
F (n) = mx(n) ; 2x(n ; 1) + x(n ; 2)]
F1 (n) = F2 (n) = ;K x1 (n) ; x2 (n)]
F1 (n) = F2 (n) =
;Z x1 (n) ; x1 (n ; 1) ; x2(n) + x2 (n ; 1)]
Conditional link : F1(n) = F2(n) = ;K x1 (n) ; x2 (n)] ;
Z x1 (n) ; x1(n ; 1) ; x2 (n) + x2 (n ; 1)]
Mass :
Spring :
Damper :
(4.34)
(4.35)
(4.36)
(4.37)
Temporal discretization introduces an error to the frequency of each of the modeled harmonic in the form
2 T 2
! ! 24
59
Chapter 4. Physical Models
Since the sampling frequency fs is the reciprocal of the time step T , the error can be
reduced by increasing the sampling frequency. Using a sampling frequency of three
times the frequency of the highest harmonic component guarantees a maximum error
of 5 percent on all partials.
The vibrating string is modeled with N masses connected with N ; 1 identical
parallel springs and dampers as illustrated in Figure 4.4. This continuum of points
on the string is connected at both ends to rigid end supports with a damper and a
spring in parallel. In this case there will be N harmonics present in the signal.
Rigid
end
N
Point
mass
damper
Rigid
end
spring
A model of a string according to the CORDIS system. N point masses
are connected to each other and to rigid end supports with a damper and a spring
in parallel at each connection.
Figure 4.4:
The creators of CORDIS have developed a system called ANIMA for two and
three dimensional elements (Florens and Cadoz, 1991).
4.4 Comparison of the Methods Using Numerical
Acoustics
In this section the three presented methods are compared by discussing the application of each method to an acoustic system: the guitar.
Using nite dierence equations for simulating the vibration of a string is presented in Section 4.1. The nite dierence method is very accurate in reproducing
the original waveform if the model parameters are correct. The approach is interesting especially in a scientic sense, because the vibratory motion can be observed at
any discrete point on the string. Furthermore, the parameters of the model are the
actual parameters of the real instrument, such as the stiness and the loss parameters of the string, and the input admittance at the bridge. These parameters can
be obtained via measurements and analysis on both the instrument and the signals
produced by it.
60
4.4. Comparison of the Methods Using Numerical Acoustics
For real-time sound synthesis purposes, the nite dierence model is not very
attractive. The model can only be applied to a simple structure, such as a vibrating string, in real-time including a guitar body in the model would imply
the need for hybrid systems. An estimation on the complexity of the computation can be obtained by inspecting Equation 4.10. For each of the N points on
a string, ve multiplications and eight summations are needed, with an additional
multiplication and summation if an excitation is applied at that point. For good
spatial resolution, N needs to be large. So several hundreds of operations are
needed for every output sample. Also, numerical dispersion might be a problem
with the FD method (Rocchesso, 1998). An example program for simulating string
vibration as well as the eect of each individual parameter is written by Kurz and
Feiten (1996). The program for Silicon Graphics workstations can be downloaded
at ftp://ftp.kgw.tu-berlin.de/pub/vstring/.
With modal synthesis the guitar can be divided into three substructures, one
for every functional part of the instrument, namely, the excitation, the vibrating
strings, and the body radiating the sound eld. The excitation substructure only
interacts with other parts when the string is being plucked. The excitation can be
applied at any points on other two substructures. The vibrating string is simulated
with N parallel independent second-order resonators, each producing one harmonic
component. The resonators can be computationally eciently implemented, but
a large number of them are needed for high-quality synthesis. The model for the
body of the instrument is obtained by modal analysis of the structure. This is a
very time-consuming process, especially for a complex structure, such as the violin
(Adrien, 1991).
If a body of a real instrument is used as a transducer, the radiated sound eld,
produced by the vibrating string coupled to the body, can be simulated. Naturally,
this method can also be used with other methods capable of producing a driving
force signal at the bridge.
The CORDIS system divides each vibrating structure into idealized elements, i.e.,
point masses vibrating in one direction connected with ideal dampers and springs.
A vibrating string is thus simulated with N point masses connected together with
N ; 1 links composed of a damper and a string connected in parallel. This structure
is capable of producing N harmonics. The number of computational operations
for each cell is relatively low. An estimation can be made by analyzing Equations
4.34 - 4.36. One output sample requires approximately 3N multiplications, and 6N
summations. Unfortunately, an estimation on the number of points needed for the
simulation of a guitar body was not available.
To draw a summary, several observations are made. The nite dierence method
can be used for simulating vibrations on essentially one dimensional objects very accurately. The other methods attempt to be more general at the cost of the accuracy
and the detailed mathematic presentation of the vibratory phenomena. The nite
dierence method and the modal synthesis method provide tools for the study of
real instruments. None of the methods is very well applicable for real-time sound
synthesis purposes. The rst reason is the computational cost when high-quality
61
Chapter 4. Physical Models
synthesis is desired. Second, the parameters of the model are non-intuitive in a
musical sense, and they are hard to control in the same way the actual instrument
is controlled, especially in real-time performance situations. Finally, sound synthesis methods with more ecient computation and control exist, especially for string
instruments and woodwinds with conical bores.
62
5. Digital Waveguides and Extended
Karplus-Strong Models
Digital waveguides and single delay loop (SDL) models are the most ecient methods
for physics-based real-time modeling of musical instruments. High-quality models
exist for a number of musical instruments, and the research in this eld is active.
In this chapter the digital waveguides are rst discussed. Second, waveguide
meshes, which are 2D and 3D models, are presented. The equivalence of the bidirectional digital waveguide model and the SDL model is detailed by (Karjalainen
et al., 1998) and it will be described shortly. The last sections of the chapter present
a case study of modeling the acoustic guitar using commuted waveguide synthesis.
5.1 Digital Waveguides
The concept of digital waveguides has been developed by Smith (1987, 1992, 1997).
Digital waveguides and methods based on nite dierences are closely related in
that they both start from the premise of solving the wave equation. We recall from
Section 4.1 that with the nite dierence method the wave equation is solved in a set
of discrete points on the vibrating object. At every time period a physical variable,
such as displacement, is computed for every point. This implies that the vibratory
motion of the whole discretized vibrating object is readily observable. While this
may be attractive for the study of the vibrating object and the vibratory motion,
more ecient methods are needed for sound synthesis purposes.
5.1.1 Waveguide for Lossless Medium
The digital waveguide is based on a general solution of the wave equation in a onedimensional homogeneous medium. The lossless wave equation for a vibrating string
can be expressed as (Morse and Ingard, 1968)
@2y = " @2y K @x
2
@t2
(5.1)
where K is the string tension, " the linear mass density, and y the displacement
of the string. This equation is applicable to any lossless one dimensional vibratory
63
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
y ( x-ct )
l
y ( x+ct)
r
x
Figure 5.1:
d'Alembert's solution of the wave equation.
motion, like that of the air column in the bore of a cylindrical woodwind instrument. Naturally in that case the parameters and the wave variables are interpreted
accordingly. It can be seen by direct computation that the equation is solved by an
arbitrary function of the form
y(x t) = yl(x ; ct) or
y(x t) = yr(x + ct)
where
(5.2)
r
c = K" :
The functions yl(x ; ct) and yr(x + ct) can be interpreted as traveling waves going
left and right, respectively. The general solution of the wave equation is a linear
combination of the two traveling waves and it is pictured in Figure 5.1. This is the
d'Alembert's solution to the wave equation.
The only restriction posed by the d'Alembert's solution is that the functions
yl(x ; ct) and yr(x + ct) have to be twice dierentiable in both x and t. However,
when the linear wave equation is developed for the real one-dimensional vibrator,
the amplitude of the vibration is assumed to be small. Physically, in the case of
a vibrating string this means that the slope of the vibrating string can only have
values much lower than one. Similarly, vibrating air columns can exhibit only small
variations of pressure around the static air pressure.
The digital waveguide is a discretization of the functions yl(x ; ct) and yr(x + ct)
and it is obtained by rst changing the variables
x ;! xm = mX
t ;! tn = nT
where T is the time step, X is the corresponding step in space, and m and n are
the new integral-valued time and space variables. The new variables are related by
c= X
T:
Substitution of the new variables to the d'Alembert's solution of the wave equation
64
5.1. Digital Waveguides
y+(n)
y+(n - m )
m sample delay line
y(n,k )
y- (n)
y- (n + m )
m sample delay line
x =0
x =k
Figure 5.2:
yields
x = mcT
The one-dimensional digital waveguide, after (Smith, 1992).
y(xm tn) = yr(tn ; xcm ) + yl(tn + xcm )
mX )
= yr(nT ; mX
)
+
y
l (nT +
c
c
= yr(T (n ; m)) + yl(T (n + m)):
(5.3)
Equation 5.3 can be simplied by dening
y+(n) = yr(nT ) and
y;(n) = yl(nT ):
(5.4)
In this notation, the + superscript denotes the traveling wave component going to
the right and the ; superscript the component going to the left.
Finally, the mathematical description of the digital waveguide is obtained with
the two discrete functions y+(n ; m) and y;(n + m) which can be interpreted as
m-sample delay lines. The delay lines are pictured in Figure 5.2. The output from
the waveguide at point k is obtained by summing the delay line variables at that
point.
The solution to the one-dimensional wave equation provided by the waveguide
is exact at the discrete points in the lossless case as long as the wavefronts are
originally bandlimited to one half of the sampling rate. Bandlimited interpolation
can be applied to estimate the values of the traveling waves at non-integral points of
the delay lines. Fractional delay lters provide a convenient solution to bandlimited
interpolation. See (Laakso et al., 1996) and (Vlimki, 1995) for more on fractional
delay lters. A number of dierent physical quantities can be chosen as traveling
waves. See Smith (1992, or 1995) for details on conversion between wave variables.
5.1.2 Waveguide with Dispersion and Frequency-Dependent
Damping
In real vibrating objects, physical phenomena that account for attenuation of the
vibratory motion are always present. These phenomena have to be incorporated
in the model to obtain any realistic synthesis. In a general case, dispersion is also
present. A wave equation that includes both the frequency-dependent damping and
65
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
y+(n)
y+(n - m )
z
-k
k
z
G (ω)
-(m-k)
G
m-k
(ω)
y(n,k )
y- (n)
y- (n + m )
k
G (ω)
z-k
G
m-k
z-(m-k)
(ω)
x =k
x =0
x = mcT
(a)
y+(n)
z
-k
k
G (ω)
Hk(ω)
z
-(m-k)
G
m-k
y+(n - m )
(ω)
Hm-k (ω)
y(n,k )
y- (n)
Hk(ω)
x =0
k
G (ω)
z
-k
Hm-k (ω)
x =k
z
-(m-k)
G
m-k
y- (n + m )
(ω)
x = mcT
(b)
A lossy digital waveguide in (a). The frequency dependent gains G(!)
are lumped before observation points to obtain Gk (!) in order to get more ecient
implementation. In (b) dispersion is added in the form of allpass lters approximation
the desired phase delay, after (Smith, 1995).
Figure 5.3:
the dispersion is already presented for the vibrating string in Equation 4.1. The
complete linear, time-invariant generalization of the wave equation for the lossy sti
string is described by Smith (1995).
A frequency-dependent gain factor G(!) determines the frequency-dependent
attenuation of the traveling wave for one time step. For a detailed derivation of
an expression for G(!) from the one-dimensional lossy wave equation, see (Smith,
1995, 1992). In the waveguide a gain factor that realizes G(!) would have to be
inserted between every unit delay. However, the system is linear and time-invariant
and the gain factors can be commuted for every unobserved portion of the delay
line. This is illustrated in Figure 5.3 (a) where the losses are consolidated before
each observation point.
When the fourth-order derivative with respect to displacement y is present in the
wave equation, the velocity of the traveling waves is not constant but it is dependent
on the frequency. This is to say that the wavefront shape will be constantly evolving
as the higher frequency components travel with a dierent velocity than the lower
frequency components. This physical phenomenon is present in every physical string
and it is called dispersion. The dispersion is mainly caused by stiness of the string.
For derivation of an expression for the frequency dependent velocity, see Smith
(1995).
The dispersion can be taken into account in the waveguide model by inserting
66
5.1. Digital Waveguides
an allpass lter before each observation point as is done in Figure 5.3 (b). The
allpass lter Ha(z) approximates the dispersion eect for a delay line of length a.
Van Duyne and Smith (1994) present an ecient method for designing the allpass
lter as a series of one-pole allpass lters. More recently, Rocchesso and Scalcon
(1996) have presented a method to design an allpass lter based on analysis of
the dispersion in recorded sound signals based on an allpass lter design method
presented by Lang and Laakso (1994).
5.1.3 Applications of Waveguides
The digital waveguide has been applied to many sound synthesis problems (Smith,
1996). A short overview of applications in dierent instrument families is given.
The rst physics-based approach to use digital lters to model a musical instrument was made for the violin by Smith (1983). Jae and Smith (1983) introduced
several extensions to the Karplus-Strong algorithm that enable high-quality synthesis of plucked strings including an allpass lter in the delay loop to approximate the
non-integral part of the delay.
Since those pioneer works, many improvements and further extensions have been
presented for plucked string synthesis. These include Lagrange interpolation for netuning the pitch and producing smooth glissandi (Karjalainen and Laine, 1991), and
allpass ltering techniques to simulate dispersion caused by string stiness (Smith,
1983), (Paladin and Rocchesso, 1992), and (Van Duyne and Smith, 1994). Commuted waveguide synthesis technique is an ecient way to include a high-quality
model of an instrument body to waveguide synthesis. It has been proposed by
(Smith, 1993) and (Karjalainen et al., 1993). Vlimki et al. (1995) have presented
a method to produce smooth glissandi with allpass fractional delay lters. A parameter calibration method based on the STFT was developed by Karjalainen et
al. (1993) and further elaborated by Vlimki et al. (1996). A similar approach is
also made by Laroche and Jot (1992). These works are extended and an automated
calibration system was implemented by Tolonen and Vlimki (1997). Multirate
implementations of the string model and separate low-rate body resonators are presented by Smith (1993), Vlimki et al. (1996), and Vlimki and Tolonen (1997a,
1997b).
The plucked-string algorithm is also utilized to synthesize electric instrument
tones. Sullivan (1990) extended the Karplus-Strong algorithm to synthesize electric
guitar tones with distortion and feedback. Rank and Kubin (1997) have developed
a model for slapbass synthesis.
Waveguide synthesis for the piano is presented by Smith and Van Duyne (1995)
and Van Duyne and Smith (1995a) where a model of a piano hammer (Van Duyne
and Smith, 1994), 2D digital waveguide mesh (Van Duyne and Smith, 1993a, see also
Section 5.2), and allpass ltering techniques for simulating stiness of the strings and
the soundboard are combined together. Another development of the piano hammer
model is presented by Borin and Giovanni (1996).
67
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
Waveguide synthesis has also been applied to several wind instruments. The clarinet was one of the rst applications by Smith (1986), Hirschman (1991), Vlimki
et al. (1992b), and Rocchesso and Turra (1993). A waveguide model for the ute
has been proposed by Karjalainen and Laine (1991) and Vlimki et al. (1992a).
Vlimki et al. (1993) propose a model for the nger holes in woodwind bores. Brass
instrument tones have been simulated with waveguides by Cook (1991), Dietz and
Amir (1995), Msallam et al. (1997), and Vergez and Rodet (1997). Cook (1992) has
created a device that can control models of the wind instrument family.
SPASM is a DSP program by Cook (1993) to model the sound processing mechanism of a human in real time. It also provides a graphical user interface with an
image of the vocal tract shape.
5.2 Waveguide Meshes
The digital waveguide presented in the previous section is very ecient in modeling
one-dimensional vibrators. If modeling of vibratory motion in a 2D or 3D object is
desired, the digital waveguide can be expanded to a waveguide mesh. Applications of
waveguide meshes can be found, for instance, in modeling membranes, soundboards,
cymbals, gongs, and room acoustics.
In this section, the two-dimensional waveguide mesh is discussed. Dierent implementation of the 2D waveguide mesh are given by Van Duyne and Smith (1993a,
1993b), Fontana and Rocchesso (1995), and Savioja and Vlimki (1997, 1996).
Expansion to a three-dimensional mesh is relatively straightforward, as well as to
the mathematically interesting N -dimensional mesh. An interesting 3D formulation
not discussed here is the tetrahedral waveguide mesh presented by Van Duyne and
Smith (1995b, 1996).
The traveling plane wave solution of the two-dimensional wave equation is given
as (Morse and Ingard, 1968)
@ 2 u(t x y) = c2 @ 2 u(t x y) + @ 2 u(t x y)
@t2
@x2
@y2
, Z
u(t x y) =
f (x cos() + y sin() ; ct)d:
(5.5)
(5.6)
where denotes the direction of the plane wave. The integral involves an innite
number of traveling waves that are divided into components traveling in the x and
y-directions.
5.2.1 Scattering Junction Connecting N Waveguides
To be able to formulate a waveguide mesh, a junction of waveguides needs to be
developed. Connection of waveguides is pictured in Figure 5.4 where scattering junction S connects N bi-directional waveguides with impedances Ri , i = 1 2 : : : N .
68
5.2. Waveguide Meshes
R1
RN
R2
S
R3
R4
A scattering junction, after (Van Duyne and Smith, 1993b). N waveguides are connected together with no loss of energy.
Figure 5.4:
For the connection to be physically meaningful, two conditions are required. The
values of the wave variables, e.g., vibration velocities or sound pressures, have to be
equal at the point of the junction
vS = v1 = v2 = : : : = vN (5.7)
where vS is the value of the wave variable in the junction. Equation 5.7 states that
the strings move together all the time. Second, the sum of the forces exerted by the
strings or ows in the tubes must equal to zero
N
X
k=1
fk = 0:
(5.8)
Recalling from the previous section the denitions
vk = vk+ + vk; fk = fk+ + fk; fk+ = Rk vk+ and fk; = ;Rk vk;
the two constraints of Equations 5.7 and 5.8 can be developed further as
N
X
k=1
Rk vk =
=
=
N
X
k=1
Rk vk+ +
N
X
k=1
Rk vk;
equals 0
N
N
N
+
;
+
Rk vk + Rk vk + Rk vk
Rk vk;
k=1
k=1
k=1
k=1
N
2 Rk vk+:
k=1
X
X
zN
X
}|
X
;
{
X
Now using vS = vk an expression for the wave variable at the junction is obtained
69
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
as
2
N
X
k=1
vS = P
N
Rk vk+
k=1 Rk
:
(5.9)
The outputs of the junction are obtained by applying vS = vk = vk+ + vk; as
vk; = vS ; vk+
(5.10)
5.2.2 Two-Dimensional Waveguide Mesh
The rectilinear waveguide mesh formulation of the two-dimensional wave equation
consists of delay elements and 4-port scattering junctions. Such a system is pictured
in Figure 5.5. The scattering junctions are marked with Slm where l denotes the
index to the x-direction and m to the y-direction. The discrete time variable is n.
The two delay elements between the ports of consecutive scattering junctions form
a bi-directional delay unit.
If the medium is assumed isotropic, the impedances Rk are equal and the junction
equations for junction Slm, denoted S for convenience, are obtained from Equations
5.9 and 5.10 as
4
X
1
vS(n) = 2 vk+(n)
(5.11)
k=1
and
vk;(n) = vS (n) ; vk+(n) k = 1 2 3 4:
(5.12)
This formulation can be interpreted as a nite dierence approximation of the twodimensional wave equation as shown by Van Duyne and Smith (1993a, 1993b).
5.2.3 Analysis of Dispersion Error
The formulation of the waveguide mesh presented above has some drawbacks. The
wave propagation speed and magnitude response depend on both the direction
of wave motion and frequency. This can be illustrated by examining the twodimensional discrete Fourier transform of the nite dierence scheme. The 2D DFT
produces a 2D frequency space so that each point (1 2) corresponds to a spatial
frequency
q
= 12 + 22:
The coordinates 1 and 2 of the 2D frequency space are taken to correspond to x
and y dimensions of the waveguide mesh, respectively.
The ratio of the actual propagation speed to the desired propagation speed in
the rectilinear waveguide mesh can be computed as (Van Duyne and Smith, 1993a)
p
p
c0(1 2) = 2 arctan 4 ; b2 (5.13)
c
T
b
70
5.2. Waveguide Meshes
y -direction
z-1
z-1
z-1
z-1
+
+
v2
v2
z-1
+
v1
z-1
Sl,m
z-1
+
v3
+
v1
z-1
Sl+1,m
z-1
+
v +4
v +4
z-1
z-1
z-1
z-1
+
+
v2
z-1
v2
+
v1
z-1
Sl,m-1
z-1
z-1
v3
+
+
v1
z-1
Sl+1,m-1
z-1
+
v3
v +4
z-1
v3
v +4
z-1
z-1
z-1
z-1
x -direction
Figure 5.5:
1993a).
Block diagram of a 2D waveguide mesh, after (Van Duyne and Smith,
where
b = cos(1T ) + cos(2T )
(5.14)
and T is the sampling interval.
The eect of dispersion error can be suppressed using dierent types of waveguide formulations. Savioja and Vlimki (1996, 1997) propose to use an interpolated
waveguide mesh that utilizes deinterpolation to approximate unit delays in the diagonal directions. Fontana and Rocchesso (1995) suggest a tessellation of the ideal
membrane into triangles. The ratio of propagation speeds is pictured in Figure 5.6 as
a function of frequency for four dierent types of waveguide formulations. In Figure
5.6 (a) (Van Duyne and Smith, 1993a), the speed ratio of the rectilinear formulation is pictured. In (b) and (c) the speed ratios of a hypothetical (non-realizable)
8-directional waveguide and a deinterpolated 8-directional waveguide are depicted
(Savioja and Vlimki, 1997). In Figure 5.6 (d) the speed ratio of the triangular
tessellation is illustrated (Fontana and Rocchesso, 1995).
The distance from center of the plots in Figures 5.6 (a) - (d) corresponds to the
71
1
0.8
0.5
0.6
0 ω
−0.5
0
ω
Normalized speed
Normalized speed
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
1
0.8
0.5
0.6
0
ω
0.5 −0.5
0.5 −0.5
(b)
1
0.8
0.5
0.6
0 ω
−0.5
0.5 −0.5
Normalized speed
Normalized speed
(a)
0
ω
(c)
0 ω
−0.5
1
0.8
0.5
0.6
0 ω
−0.5
0
ω
0.5 −0.5
(d)
Dispersion in digital waveguides. The wave propagation speed is plotted
as a function of spatial frequency and direction for a rectilinear mesh in (a), for a
hypothetical 8-directional mesh in (b), for a deinterpolated 8-directional mesh in (c),
and for a triangular tessellation in (d). The spatial frequency is the distance from
the origin and the 1 T - and 2 T -axis of the horizontal plane correspond to x and y
directions.
Figure 5.6:
72
5.3. Single Delay Loop Models
spatial frequency. The axis of the horizontal plane are the 1 - and 2-axis which
correspond to the x- and y-directions in the mesh, respectively. The contours of
equal ratios are pictured on the bottom of each gure. The dependence of the
propagation speed ratio on both the frequency and direction can be seen in (a). In
other gures, the dependence on the direction can be observed to be less severe. It
should be noted that the mesh of (b) is not realizable.
5.3 Single Delay Loop Models
First extensions to the Karplus-Strong algorithm presented in Section 2.3 were derived by Jae and Smith (1983). Even before that, Smith (1983) had developed
a model for the violin that included a string model that is similar to the generic
Karplus-Strong model in Figure 2.5 (b). Those works were the rst to take a physical modeling interpretation of the Karplus-Strong model.
The digital waveguide presented in the previous chapter can be developed to an
SDL model1 in certain situations. In this chapter, the SDL model is derived for
the guitar, as has been done by Karjalainen et al. (1998). In this context, only
the case of a string with force signal output at the bridge will be considered. This
corresponds to the construction of a classical acoustical guitar. The case of pickup
output, which corresponds to electric guitars, is presented by Karjalainen et al.
(1998). We start with a continuous-time waveguide model in the Laplace domain
and develop a discrete-time model which can be identied as an SDL model.
In the discussion to follow the transfer functions of the model components are
described in the Laplace transform domain. The Laplace transform is an ecient
tool in linear continuous-time systems theory. Particularly, time-domain integration
and derivative operations transform into division and multiplication by the Laplace
variable s, respectively. The complex
Laplace variable s may be changed with j!
p
(where j is the imaginary unit ;1, ! is the radian frequency ! = 2f , and f is the
frequency in Hz), in order to derive the corresponding representation in the Fourier
transform domain, i.e., the frequency domain. For a discrete-time implementation,
the continuous-time system is nally approximated by a discrete-time system in the
z-transform domain. For more information on Laplace, Fourier, and z-transforms,
see a standard textbook on signal processing, such as (Oppenheim et al., 1983).
In the next subsection, a waveguide model for the acoustic guitar is presented.
In the one after that, the digital waveguide representation is developed into an SDL
model.
1 In this document the models of a vibrating string consisting of a loop with single delay line are
called single delay loop (SDL) models to distinguish them from both the non-physical KS algorithm
and the bidirectional digital waveguide models.
73
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
H
H E1,R1 ( s)
L1,E1 (s )
L1
E1
R1
A 1 ( s)
Delay line
R ( s)
f
X 1 ( s)
1/2
+
X ( s)
R b ( s)
X 2 ( s)
-
Bridge
force
Z ( s)
s
F (s )
Delay line
L2
E2
H E2,L2 ( s)
R2
H
A 2 ( s)
R2,E2 ( s )
Dual delay-line waveguide model for a plucked string with a force output
at the bridge (Karjalainen et al., 1998).
Figure 5.7:
5.3.1 Waveguide Formulation of a Vibrating String
In Figure 5.7 a dual delay-line waveguide model for an ideally plucked acoustic guitar
string with transversal bridge force as an output is presented. The delay lines and
the reection lters Rb (s) and Rf (s) form a loop in which the waveforms circulate.
The two reection lters simulate the reection of the waveform at the termination
points of the vibrating part of the string at the bridge and at the corresponding fret,
respectively. The lters are phase-inversive, i.e., they have negative signs, and they
also contain slight frequency-dependent damping. Let us assume for now that the
delay lines correspond to the d'Alembert's solution of the wave equation for a sti
and lossy string. In this case they are dispersive and they also attenuate the signal
continuously in a frequency-dependent manner.
Pluck excitation X (s) is divided into two parts X1(s) and X2 (s), so that X1 (s) =
X2(s) = X (s)=2: The excitation parts are fed into the waveguides at points E1 and
E2. It has been shown by Smith (1992) that an ideal pluck of the string can be
approximated by a unit impulse if acceleration waves are used. Thus, it is attractive
to choose acceleration as the wave variable and, in this context, A1(s) and A2(s)
correspond to the values of the right and the left traveling acceleration waves at
positions R1 and R2, respectively.
The output signal of interest is the transverse force F (s) applied at the bridge
by the vibrating string. It is obtained from the acceleration wave components A1(s)
and A2(s) as
F (s) = F +(s) + F ;(s) = Z (s)V +(s) ; V ;(s)] = Z (s) 1s A1(s) ; A2(s)] (5.15)
i.e., the bridge force F (s) is the bridge impedance Z (s) times the dierence of
the string velocity components V + (s) and V ;(s) at the bridge. In the last form
of Equation 5.15 the velocity dierence V +(s) ; V ;(s) is expressed as integrated
acceleration dierence 1s (A1(s) ; A2(s)).
74
5.3. Single Delay Loop Models
Figure 5.7 also includes the transfer functions between the unobserved and unmodied parts of the waveguides HAB(s) refers to the transfer function from point
A to point B. These transfer functions are elaborated in the following subsection
where the bi-directional digital waveguide model is reformulated as an SDL model.
5.3.2 Single Delay Loop Formulation of the Acoustic Guitar
In the waveguide formulation pictured in Figure 5.7 there are four points, namely,
E1, E2, R1 and R2, at which either a signal (X1(s) and X2(s)) is fed to the waveguide
or the wave variables (A1(s) and A2(s)) are observed. It is immediately apparent
that the formulation can be simplied by combining transfer function Rf (s) and
transfer functions HE2L2 (s) and HL1E1(s) of the two parts of the lossy and dispersive
waveguide to the left of the excitation point E1 and E2. However, it is more ecient
to attempt to reduce the number of points in which the wave variables are processed
or observed.
The explicit input to the lower delay line can be removed by deriving an equivalent single excitation at point E1 that corresponds to the net eect of the two
excitation components at points E1 and E2. The equivalent single excitation at E1
can be expressed as 2
XE1eq (s) = X1(s) + HE2L2 (s)Rf (s)HL1E1(s)X2 (s)
= 21 1 + HE2E1 (s)]X (s)
= HE(s)X (s)
(5.16)
where HE2E1 (s) is the left-side transfer function from E2 to E1 consisting of the two
parts of the lossy and dispersive delay lines HE2L2(s) and HL1E1(s), and reection
function Rf (s). Thus, HE(s) is the equivalent excitation transfer function.
In a similar fashion, one of the explicit output points can be removed in order to
obtain a structure with only single input and output positions. Since the guitar body
is driven by the force applied by the vibrating string at the bridge, it is apparent
that an acceleration-to-force transfer function is required. In Equation 5.15 output
force F (s) is expressed in terms of acceleration waves A1(s) and A2(s). This is
further elaborated as
F (s) = Z (s) 1s A1(s) ; A2(s)]
= Z (s) 1 A1(s) ; Rb (s)A1(s)]
(5.17)
s
= Z (s) 1s 1 ; Rb (s)]A1(s)
= HB(s)A1(s)
where HB(s) is the acceleration-to-force transfer function at the bridge. Notice that
it only depends on A1(s), the wave variable of the upper delay line. In a similar
fashion one can derive an expression for F (s) depending only on A2(s).
2 `eq' in X
E1 eq (s) stands for `equivalent'.
75
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
To develop an expression for A1(s) in terms of the equivalent input XE1eq (s), we
rst write
A1(s) = HE1R1 (s)XE1eq(s) + Hloop (s)A1(s)
(5.18)
Hloop (s) = Rb (s)HR2E2(s)HE2E1(s)HE1R1(s)
(5.19)
where
i.e., Hloop (s) is the transfer function when the signal is circulated once around the
loop. Thus, the sum terms of Eq. 5.18 correspond to the equivalent excitation
signal XE1eq (s) transfered to point R1 and signal A1(s) transfered once along the
loop. Solving Equation 5.18 for A1(s), we obtain
A1(s) = HE1R1(s) 1 ; H1 (s) XE1eq(s)
loop
= HE1R1(s)S (s)XE1eq (s)
(5.20)
where S (s) is the string transfer function that represents the recursion around the
string loop.
Finally, the overall transfer function from excitation to bridge output is written
as
HE,B(s) = XF ((ss)) = 12 1 + HE2E1 (s)] 1 H;E1HR1 (s()s) Z (s) 1s 1 ; Rb(s)]
(5.21)
loop
or more compactly, based on the above notation
HE,B(s) = HE (s)HE1R1(s)S (s)HB(s)
(5.22)
which represents the cascaded contribution of each part in the physical string system.
At this point the continuous-time model of the acoustic guitar in the Laplace
transform domain is approximated with a discrete-time model in the z-transform
domain. This approximation is needed in order to make the model realizable in a
discrete-time form. We rewrite Equation 5.22 in the z-transform domain
HE,B(z) = HE(z)HE1R1 (z)S (z)HB (z)
where
HE(z) = 21 1 + HE2E1(z)]
S (z) = 1 ; H1 (z)
loop
HB(z) = Z (s)I (z)1 ; Rb(z)]:
(5.23)
(5.23a)
(5.23b)
(5.23c)
Filter I (z) is a discrete-time approximation of the time-domain integration operation.
Equation 5.23 is interpreted by examining a block diagram in Figure 5.8. It shows
qualitatively the delays and the discrete-time approximations of the cascaded lter
76
5.3. Single Delay Loop Models
Filtering due to excitation position
H (z)
E
-1
1/2
Delay
Lowpass
Input
Wave propagation, from excitation to bridge
H
(z)
E1,R1
Delay
Lowpass
String loop
S ( z)
Lowpass
Delay
Filtering due to bridge coupling
H (z)
B
Lowpass
Integrator
Output
A block diagram of transfer function components as a model of the
plucked string with force output at the bridge (Karjalainen et al., 1998).
Figure 5.8:
components in Equation 5.21. The rst block corresponding to HE(z) simulates the
comb ltering eect depending on the pluck position. Notice that the phase inversion
of the reection lter is explicated with the multiplication by -1. The second block
corresponding to the transfer function from E1 to R1 in Figure 5.7 has a minor eect
and is usually discarded in the nal implementation of the model. This reduction
is justied by noticing that the gain term G(!) determining the attenuation of the
traveling wave in one time step is extremely close to unity, and thus in the short
time it takes for the wave to travel from E1 to R1 the attenuation is negligible.
The third block in Figure 5.8 is the string loop and it simulates the vibration of
the string. The delay in the loop corresponds in length to the sum of the two delay
lines in Figure 5.7. the losses of a single round in the loop are consolidated in the
lowpass lter. In the last block, the feedforward lter is typically discarded and
only the integrator is implemented. In this case the lowpass lter corresponds to
the opposite of reection lter at the bridge and it is very close to unity. Thus, the
sum of the ltered and the direct signal is approximated as being equal to 2.
Notice that the model of the acoustic guitar presented in Figure 5.8 is indeed a
77
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
single delay loop model, with the only loop in the third block. The presented model
describes the vibration of a plucked string. It includes the eects of the plucking
position and the output signal corresponds to the force applied by the string at
the guitar body in a physically relevant manner. In the next section this model is
extended to include models of the guitar body, the two vibration polarizations and
sympathetic couplings between the strings.
5.4 Single Delay Loop Model with Commuted Body
Response
The sound production mechanism of the guitar can be divided into three functional
substructures, namely, the excitation, the vibration of the string, and the radiation
from the guitar body. It is advantageous to retain this functional partition when
developing a computational model of the acoustic guitar, as suggested by many
studies presented in the literature (Smith, 1993 Karjalainen et al., 1993 Vlimki
et al., 1996 Karjalainen and Smith, 1996 Tolonen and Vlimki, 1997 Vlimki
and Tolonen, 1997a). In the previous section a detailed model for the vibration
of a single string was described. In order to obtain a high-quality simulation of
the acoustic guitar, the excitation and the body models have to be incorporated in
the instrument model. The model should be suciently general to accommodate
such eects as those produced by the two vibration polarizations and sympathetic
couplings between the strings.
In the virtual instrument, the excitation model determines the amplitude of
the sound, the plucking type, and the eect of the plucking point, while the body
model gives an identity to the instrument, i.e., it determines what type of a guitar is
being modeled. The body model includes the body resonances of the instrument and
determines the directional properties of the radiation. The directional properties are
not included in the model presented here, but they can be added by post-processing
the synthesized signal (Huopaniemi et al., 1994 Karjalainen et al., 1995).
In this section, the principle of commuted waveguide synthesis (CWS) (Smith,
1993 Karjalainen et al., 1993) is rst discussed for ecient realization of the excitation and body models. Second, a physical model that includes the aforementioned
features is presented. In this context the synthetic acoustic guitar is only discussed
generally without going into details of the DSP structures.
5.4.1 Commuted Model of Excitation and Body
The body of the acoustic guitar is a complex vibrating structure. Karjalainen et
al. (1991) have reported that in order to fully model the response of the body,
a digital all-pole lter of order 400 or more is required. However, this kind of
implementation is impractical since it would be computationally far too expensive for
real-time applications. Commuted waveguide synthesis (Smith, 1993 Karjalainen et
al., 1993) can be applied to include the body response in the synthetic guitar signal
78
5.4. Single Delay Loop Model with Commuted Body Response
Excitation
String
Body
δ (n)
E(z)
S(z)
B(z)
y (n)
δ (n)
B(z)
E(z)
S(z)
y (n)
x
S(z)
y (n)
exc
(n)
The principle of commuted waveguide synthesis. On the top, the
instrument model is presented as three linear lters. In the middle, the body model
B (z) is commuted with the excitation and string models E (z) and S (z). On the
bottom, the body and excitation models are convolved into a single response xexc (n)
that is used to excite the virtual guitar.
Figure 5.9:
in a computationally ecient manner. It is based on the theory of linear systems,
and particularly, on the principle of commutation.
In CWS the instrument model is interpreted as pictured on the top of Figure 5.9,
i.e., as the excitation, the vibrating string, and the radiating body. These parts are
presented as linear lters with transfer functions E (z), S (z), and B (z), respectively.
Since the system is excited with an impulse (n), the cascaded conguration implies
that the output signal y(n) is obtained as a convolution of the impulse responses
e(n), s(n), and b(n) of the three lters and the unit impulse (n), i.e.,
y(n) = (n) e(n) s(n) b(n) = e(n) s(n) b(n)
(5.24)
where denotes the convolution operator dened as
h1(n) h2 (n) =
1
X
k=;1
h1(k)h2 (n ; k):
(5.25)
In the z-transform domain Equation 5.24 is expressed as
Y (z) = E (z)S (z)B (z):
(5.26)
Since we approximate the behavior of the instrument parts with linear lters, we
can apply the principle of commutation and rewrite Equation 5.26 as
Y (z) = B (z)E (z)S (z)
(5.27)
as illustrated in the middle part of Figure 5.9. In practice, it is useful to convolve
the impulse responses b(n) and e(n) of the body and excitation models into a single
79
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
Horizontal polarization
mp
Pluck and Body
Wavetable
He (z)
Sh(z)
mo
P (z)
out
gc
1- m p
Sympathetic
couplings from
other strings
C Sympathetic couplings
to other strings
1- m o
S v(z)
Vertical polarization
An extended string model with dual-polarization vibration and sympathetic coupling (Karjalainen et al., 1998).
Figure 5.10:
impulse response denoted by xexc (n) on the bottom of Figure 5.9. This signal is used
to excite the string model and it can be precomputed and stored in a wavetable.
Typically, several excitation signals are used for one instrument. The excitation
signal is varied depending on the string and fret position as well as on the playing
style. Computation of the excitation signal from a recorded guitar tone is presented
in Section 5.4.3.
There are several other ways to incorporate a model of the instrument body in
a realizable form. These are discussed by Karjalainen and Smith (1996) and they
include methods of reducing the lter order using a conformal mapping to warp the
frequency axis into a domain that better approximates the human auditory system,
and of extracting the most prominent modes in the body response. These modes
are reproduced to the synthetic signal by computationally cheap lters (Karjalainen
and Smith, 1996 Tolonen, 1998).
5.4.2 General Plucked String Instrument Model
The model illustrated in Figure 5.10 exemplies a general model for the acoustic
guitar string employing the principle of commuted synthesis. A library of dierent
pluck types and instrument bodies is stored in a wavetable on the left. The excitation
signal is modied with a pluck shaping lter He (z) which brings about brightness
and gain control, and a pluck position equalizer P (z) which simulates the eect of the
excitation position to the synthetic signal. The pluck position equalizer corresponds
to the transfer function component presented on top of Figure 5.8.
After the excitation signal is fetched from a wavetable and ltered by transfer
functions He(z) and P (z), it is fed to the two string models Sh(z) and Sv (z) in a
ratio determined by the gain parameter mp . The string models simulate the eect
of the two polarizations of the transversal vibratory motion and they are typically
slightly mistuned in delay line lengths and decay rates to produce a natural-sounding
synthetic tone. The output signal is a sum of the outputs of the two polarization
models mixed in a ratio determined by mo .
In the instrument model of Figure 5.10 sympathetic couplings between the strings
80
5.4. Single Delay Loop Model with Commuted Body Response
Amplitude
1
0
Amplitude
−1
0
1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5
0.6
Time (s)
0.7
0.8
0.9
1
0
−1
0
1
Amplitude
0.1
0
−1
0
Figure 5.11: An example of the eect of mistuning the polarization models Sh (z )
and Sv (z ). Top: equal parameter values, middle: mistuned decay rates, and bottom:
mistuned fundamental frequencies.
are implemented by feeding the output of the horizontal polarization to a connection
matrix C which consists of the coupling coecients. The matrix is expressed as
2
66
6
C = 66
64
gc1 c12
c21 gc2
c31 c32
...
cN 1
c13 c23
gc3
...
3
77
... 77
77
5
c1N
(5.28)
gcN
where N is the number of dual-polarization strings, the coecients gck (for k =
1 2 : : : N ) denote the gains of the output signal to be sent from the kth horizontal
string to its parallel vertical string, and coecients cmk are the gains of the kth
horizontal string output to be sent to the mth vertical string. Notice that the gain
terms gck implement a coupling between the two polarization in the kth string and
that the coecient gc is also presented explicitly in the gure. With this kind
of structure, it is possible to obtain a simulation of both sympathetic coupling
between string and coupling of the two polarizations within a string. The structure
is inherently stable since there are no feedback paths in the model. Notice also that
with parameters mp and mo it is possible to change the conguration of the virtual
instrument. For instance, by setting mp = 1, the vertical polarization will act as a
resonance string with the only input obtained from the horizontal polarization.
An example of the eect of mistuning the two polarization models is shown
in Figure 5.11. On the top of the gure, the model parameters are equal and an
exponential decay is resulted. In the middle, the fundamental frequencies of the
models are equal but the loop lter parameters dier from each other and a two81
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
Amplitude
1
0.5
0
−0.5
−1
0
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
Amplitude
1
0.5
0
−0.5
−1
0
Amplitude
1
0.5
0
−0.5
−1
0
Time (s)
Figure 5.12: An example of sympathetic coupling. The output of a tone E2 played
on the 6th string of the virtual guitar is plotted on the top. In the middle, the sum
of the outputs of the other virtual strings vibrating due to the sympathetic coupling
is illustrated. The output of the virtual instrument, i.e., the sum of all the string
outputs, is presented on the bottom.
stage decay is produced. On the bottom gure, the loop lter parameters are equal
while the frequencies are mistuned to obtain a beating eect. Another example is
pictured in Figure 5.12 illustrating the sympathetic couplings between the strings.
Tone E2 is played on the 6th string of the virtual instrument, and the vibration of
the string is soon damped by the player. On the top part the output of the plucked
string is depicted. In the middle, the summed output of the other strings vibrating
sympathetically is illustrated. On the bottom the output of the virtual instrument is
plotted. Notice that the other strings continue to vibrate after the primary vibration
is damped.
5.4.3 Analysis of the Model Parameters
After the instrument model has been constructed both to closely simulate the physical behavior of a real instrument and to be eciently realizable in real time, the
model parameters have to be derived. It is natural to start by recording tones of a
82
5.4. Single Delay Loop Model with Commuted Body Response
real acoustic guitar. Since the recordings are treated as acoustic measurements of a
sound production system, they have to be performed carefully. The side eects of
the environment, such as noise and the response of the room, should be minimized.
An analysis scheme is proposed by Tolonen (1998). In this approach sinusoidal
modeling is used to obtain the decaying partials of the guitar tone as separate
additive signal components. It is shown that the sinusoidal modeling approach is
well suited for this kind of parameter estimation problem.
83
Chapter 5. Digital Waveguides and Extended Karplus-Strong Models
84
6. Evaluation Scheme
The sound synthesis methods presented in this document have been developed to
dierent types of synthesis problems. Thus it is not appropriate to compare these
methods with each other since the evaluation criteria, no matter how carefully chosen, would favor some of the methods. The purpose of the evaluation is to give some
guidelines on which methods are best suited for a given sound synthesis problem.
The methods presented in this document were divided into four groups, based
on a taxonomy presented by Smith (1991), to better compare techniques that are
closely related to each other. The groups are: abstract algorithms, sampling and
processed recordings, spectral modeling synthesis, and physical modeling. This division is based on the premises of each sound synthesis method. Abstract algorithms
create interesting sounds with methods that have little to do with sound production
mechanisms in the real physical world. Sampling and processed recordings synthesis take existing sound events and either reproduce them directly or process them
further to create new sounds. Spectral modeling synthesis uses information of the
properties of the sound as it is perceived by the listener. Physical modeling attempts
to simulate the sound production mechanism of a real instrument.
This taxonomy can also be interpreted as being based on tasks generated by the
user of the synthesis system. For evaluation purposes it is helpful to identify these
sound synthesis problems. The tasks for which methods are best suited are:
1. Abstract algorithms
creation of new arbitrary sounds
computationally ecient moderate-quality synthesis of existing musical
instruments
2. Sampling and processed recordings synthesis
reproduction of recorded sounds
merging and morphing of recorded sounds
using short sound bursts or recordings to produce new sound events
applications demanding high-sound quality
3. Spectral models
85
Chapter 6. Evaluation Scheme
simulation and analysis of existing sounds
copy synthesis (audio coding)
study of sound phenomena
pitch-shifting, time-scale modication
4. Physical models
simulation and analysis of physical instruments
copy synthesis
study of the instrument physics
creation of physically unrealizable instruments of existing instrument
families
applications requiring control of high delity
Typically a sound synthesis method can be divided into analysis and synthesis
procedures. These techniques are evaluated separately as they usually have dierent
requirements. In many cases the analysis can be done o-line and accuracy can be
gained by the cost of computation time. The synthesis part has to be typically done
in real time and exible ways to control the synthesis process have to be available.
An excellent discussion on the evaluation of sound synthesis methods is given
by Jae (1995). Ten criteria proposed by Jae are discussed in next section with
some additions. These criteria are utilized to create the evaluation scheme used in
this document in the last three sections of the chapter. In the next chapter the
evaluation scheme is applied to the synthesis methods presented in this document.
The results are collected and tabulated to ease the comparison of the methods.
The ten criteria address the usability of the parameters the quality, diversity
and physicality of sounds produced and implementation issues. One more criterion
is included in the evaluation scheme of this document. It considers the suitability
for parallel implementation of the synthesis method. These criteria are rated poor,
fair, or good for each synthesis method.
6.1 Usability of the Parameters
Four aspects of parameters are discussed: the intuitivity, physicality, and the behavior of the parameters as well as the perceptibility of parameter changes. Ratings
used to judge the parameters are presented in Table 6.1.
By intuitivity it is meant that a control parameter maps to a musical attribute
or quality of timbre in an intuitive manner. With intuitive parameters the user is
easily able to learn how to control the synthetic instrument. A signicant parameter
change should be perceivable for the parameter to be meaningful. Such parameters
are called strong in contrast to weak parameters which cause barely audible changes
(Jae, 1995). The trend is that the more parameters a synthesis system has, the
86
6.2. Quality and Diversity of Produced Sounds
weaker they are (Jae, 1995). However, too strong parameters are hard to control
as a little change on the parameter value has a drastic change on produced sound,
no matter how intuitive the parameter is.
Physical parameters provide the player of a synthetic instrument with the behavior of a real-world instrument. They correspond to quantities the player of a real
instrument is familiar with, such as, string length, bow or hammer velocity, mouth
pressure in a wind instrument, etc. The behavior of a parameter is closely related
to the linearity of the parameters. A change in a parameter should produce a
proportional change in the sound produced.
The criteria presented in this section are tabulated in Table 6.1 with ratings that
are used in the evaluation.
poor fair good
Intuitivity
Perceptibility Physicality
Behavior
Criteria for the parameters of synthesis methods with ratings used in
the evaluation scheme.
Table 6.1:
6.2 Quality and Diversity of Produced Sounds
In this section, criteria for the properties of produced sound are being discussed.
These include the robustness of the sound's identity, the generality of the method,
and availability of analysis methods. Ratings used to evaluate these criteria are
presented in Table 6.2.
The robustness of the sound is determined by how well the identity of the sound
is retained when modications to the parameters are presented. This is to say
that, e.g., a model of a clarinet should sound like a clarinet when played with
dierent dynamics and playing styles, or even if the player decides to experiment
with the parameter values. A general sound synthesis method is capable of producing
arbitrary sound events of high quality. Every existing sound synthesis method has
its short comings in generality and, indeed, every method does not even attempt to
work at all with arbitrary sounds. This criterion is still useful as one would hope to
have as general methods as possible for several synthesis problems.
For many sound synthesis methods an analysis method exists to derive synthesis
parameters from recordings of sound signals. This makes the use of the synthesis
method easier as it provides for default parameters that can then be modied to play
the synthetic instrument. The analysis part is essential for many of the methods to
be useful at all. In theory, copy synthesis or otherwise optimal parameters can be
derived for most synthesis methods using dierent kinds of optimization methods.
This is not typically desired, but instead, the analysis part often uses the knowledge
87
Chapter 6. Evaluation Scheme
of the synthesis system to obtain reliable parameters.
In many cases the analysis can be done o-line and typically it will only have to
be performed once for each instrument to be modeled. Thus, accuracy can be given
much more weight on the cost of computing time and in this context computing
eciency will be discarded as a criteria for analysis methods. In this document,
the analysis procedures of each synthesis system will be judged according to their
accuracy, generality, and demands for special devices or instruments.
Robustness of identity
Generality
Analysis methods
poor fair good
Criteria for the quality and diversity of synthesis methods with ratings
used in the evaluation scheme.
Table 6.2:
6.3 Implemention Issues
The implementation of a synthesis method has several important criteria to meet.
Eciency of the techniques is judged, latency and the control-rate are estimated.
Suitability for parallel implementation is also addressed. Rating used to evaluate
the criteria concerning implementation issues is presented in Table 6.3
Eciency is further divided into three parts: computational demands, memory
usage, and the load caused by control of the method. In many cases, the memory
requirements can be compensated by increasing computational cost (Jae, 1995).
Computational cost is rated good if one or several instances of the method can easily
run in real time in an inexpensive processor, fair if only one instance of the method
can run in real-time in a modern desk-top computer like PC or a workstation, and
poor if a real-time implementation is not possible without dedicated hardware or a
powerful supercomputer.
The control stream of the method aects both the expressivity and the computational demands. Typically, more control is possible with dense control streams
than with sparse control streams (Jae, 1995). Processing of the control stream can
be a lot more cost-decient as it can involve I/O with external devices or les. In
this context the control stream is judged by examining the amount of control made
possible by the density of the stream.
In real-time synthesis systems there will always be latency present, as the system
needs to be causal to be realizable. Latency is a problem especially with methods
that employ block calculations, such as DFT. Also with other computationally costly
synthesis methods it is sometimes advantageous to run tight loops for tens or maybe
hundreds of output samples to speed up the calculation. That can be a cause for
latency problems as well. In the ratings, poor means that the system will have
latency of tens or hundreds of milliseconds or more, fair means that if there is not
practically any extra overhead caused by, e.g., operating system the latency will
88
6.3. Implemention Issues
not be perceivable, and good indicates that the method is tolerant to some altering
overhead.
poor fair good
Computational cost Memory usage
Control stream
Latency
Parallel processing
Criteria for the implementation issues of synthesis methods with ratings
used in the evaluation scheme.
Table 6.3:
Suitability for parallel implementation can be an important factor in certain
situations. In this context it is assumed that fast communication between parallel
processes is available. The system will be rated good on suitability for parallel
processing if it can easily be divided into several processes so that communication
between processes happens approximately at the sampling rate level of the system.
Rating fair is given if the system can be divided into two processes communicating
at sampling rate level or if it is advantageous to distribute the computation at
higher communication level. The method will be judged poor if there is little or no
advantage to parallelize the processing.
The synthesis methods are rated in Table 7.1 against the criteria presented in
this section.
89
Chapter 6. Evaluation Scheme
90
7. Evaluation of Several Sound
Synthesis Methods
In this chapter the sound synthesis methods presented in this document are evaluated using the criteria discussed in the previous chapter. Ratings are tabulated
for each method and they are also collected in Table 7.1 to enable comparison of
the methods. It should be noted that the intention is not to decide which synthesis
method is the best in general for that would be impossible. Rather, the evaluation
should give some guidelines upon which a proper method can be chosen for a given
sound synthesis problem.
For some methods there are criteria that we feel we cannot evaluate and those
criteria are not rated.
7.1 Evaluation of Abstract Algorithms
7.1.1 FM synthesis
The FM synthesis parameters are strong oenders in the criteria of intuitivity, physicality, and behavior, as modulation parameters do not correspond to musical parameters or parameters of musical instruments at all, and because the method is
highly nonlinear. Thus it is rated poor in all these categories. Notice, however,
that the modulation index parameter I is directly related to the bandwidth of the
produced signal. The method has strong parameters, i.e., parameters changes are
easily audible. The rating in Perceptibility is good.
FM synthesis does not behave well when it is used to mimic a real instrument
with varying dynamics and playing styles. The parameters of the method have to
be changed very carefully in order not to lose the identity of the instrument. The
method is rated poor for robustness of identity. Generality of FM synthesis is good.
Analysis methods for FM have been proposed but the methods do not apply well
for general cases, thus the rating poor for analysis methods. The interested reader is
referred to a work by Delprat (1997) for methods of extracting frequency modulation
laws by signal analysis.
The ecient implementations of FM have made it a popular method. It is very
91
Chapter 7. Evaluation of Several Sound Synthesis Methods
cheap to implement, uses little memory, and the control stream is sparse. Minimal
latency makes the methods attractive for real-time synthesis purposes. The method
is rated good for all these criteria. FM synthesis is computationally so cheap that
distributing one FM instrument is not feasible. Naturally, several FM instruments
can be divided to run in several processors.
7.1.2 Waveshaping Synthesis
Waveshaping parameters are more intuitive (fair) than FM parameters especially
when Chebyshev polynomials are used as a shaping function. Scaling of a weighting
gain of a single Chebyshev polynomial only changes the gain of one harmonic. Thus
the parameters are neither very perceptible nor physical (poor). Depending on the
parameterization, the parameters typically behave fairly well.
Waveshaping is fairly general in that arbitrary harmonic spectra are easy to produce. By adding amplitude modulation after the waveshaping synthesis inharmonic
spectra can be produced. Noisy signals cannot be generated easily. Spectral analysis can easily be applied to obtain the amplitudes of each harmonic. This data can
be directly used for gains to the Chebyshev polynomials. The rating for analysis
methods is thus good.
Just as FM synthesis, waveshaping can be implemented very eciently and distribution of one instance is not feasible. The method is rated good for computing,
memory, and control stream eciency as well as for latency.
7.1.3 Karplus-Strong Synthesis
The few parameters of the Karplus-Strong synthesis are very intuitive, the changes
are easily audible, and are well-behaved. Thus the rating for all these criteria is
good. In the basic form, the method only has a parameter for the pitch and one
for determining the type of tone, e.g., string or percussion. The physicality is thus
rated fair.
KS synthesis is robust in that it will sound like a plucked string or a drum even
when the parameters are changed, thus good for robustness of identity. In generality
the method fails poorly. Analysis techniques for KS synthesis are not available but
for related sound synthesis methods they exist (see Section 5.4).
Just like the other abstract algorithms, the KS is very attractive to implement in
real time. The ratings for implementation issues are the same as with FM synthesis
and waveshaping.
92
7.2. Evaluation of Sampling and Processed Recordings
7.2 Evaluation of Sampling and Processed Recordings
7.2.1 Sampling
In sampling synthesis a recording of a sound signal is played back with possible
looping in the steady-state part. Sampling is controlled just by note on/o and gain
parameters. We have decided not to give ratings of these trivial parameters in order
not to disturb the evaluation of other synthesis methods.
Sampling is very general (good) in that any sound can be recorded and sampled.
The identity of the sound is retained with dierent playing styles and conditions
but at the cost of naturalness. Robustness of identity is rated fair. Analysis methods
for determining the looping breakpoints are available and usually give good results
with harmonic sounds.
Sampling is computationally very ecient (good) but it uses lot of memory (poor).
Control stream is sparse (good) and latency time small (good). Distribution of one
sampling instrument is not feasible unless, e.g., a server is utilized as a memory
storage. The rating is fair.
7.2.2 Multiple Wavetable Synthesis
Multiple wavetable synthesis methods can be parameterized in various ways and the
result of synthesis is highly dependent of the signals stored in wavetables. Thus we
decided not to give ratings on parameters of the method or the robustness of sounds
identity.
The method is general (good) and analysis methods for some implementations
are available (fair).
The method is fairly easily implemented computationally but it uses a lot of
memory (poor). Control stream is not very costly computationally (good) and latency
times can be kept small (good). Just like with sampling, a separate wavetable server
can reduce the memory requirements of a single instance of multiple wavetable
synthesis provided that fast connections are available. Suitability for distributed
parallel processing is rated fair.
7.2.3 Granular synthesis
Granular synthesis is a set of techniques that vary quite a lot from each other in
parameterization and implementation. Here a general evaluation of the concept is
attempted. In the most primitive form the parameters of granular synthesis control
the grains directly. The number of grains is typically very large and more elaborate
means to control them must be utilized. The parameters are thus rated poor in
intuitivity, perceptibility, and physicality. The system is linear and the behavior of
the parameters is good.
93
Chapter 7. Evaluation of Several Sound Synthesis Methods
Analysis methods for the pitch synchronous granular synthesis exist and they
are also ecient (good). As the asynchronous method does not attempt to model or
reproduce recorded sound signals no analysis tools are necessary. Granular synthesis
methods are general (good), and with the PSGS the robustness of identity is retained
well (good).
The implementation of the method is fairly ecient, and also the memory requirements are (fair) as the grains are short and it is typically assumed that the
signals can be composed of few basic grains. The low-level control stream can become very dense especially with AGS (poor). The method does not pose latency
problems (good) and the suitability for parallel processing is rated fair.
7.3 Evaluation of Spectral Models
7.3.1 Basic Additive Synthesis
Parameters of the basic additive synthesis control directly the sinusoidal oscillators.
Ways to reduce the control data are available and some of them will be discussed with
other spectral modeling methods. In this context only the basic additive synthesis
is discussed. The parameters are fairly intuitive in that frequencies and amplitudes
are easy to comprehend. The behavior of parameters is good as the method is linear.
Perceptivity and physicality of the parameters is poor.
Additive synthesis can in theory synthesize arbitrary sounds if an unlimited
number of oscillators is available. This soon becomes impractical as noisy signals
cannot be modeled eciently and thus the generality is rated fair. Analysis methods
(good) based on, e.g., STFT are readily available as additive synthesis is used as
a synthesis part of some of the other spectral modeling methods. Robustness of
identity is not evaluated as the control of a synthetic instrument would need more
elaborate control techniques.
A single sinusoidal oscillator can be implemented eciently but in additive synthesis typically a large number of them is required. Computational cost is rated
fair. The control data requires a large memory (poor), and the control stream is
very dense (poor). Latency time is small as the oscillators run in parallel (good).
Parallel implementation can become feasible when the number of oscillators grows
large. In distributed implementation the oscillators and corresponding controls have
to be grouped. The suitability for distributed parallel processing is rated fair.
7.3.2 FFT-based Phase Vocoder
The parameters of the FFT-based phase vocoder are directly related with the STFT
analysis, such as the FFT length, window length and type, and the hop size parameter. While these are not intuitive directly they can be comprehended in the case
of, e.g., time-scale modications or pitch shifts. Intuitivity is rated fair. Parameters like the hop size are relatively strong whereas the window type might not have
94
7.3. Evaluation of Spectral Models
any signicant eect except in some specic situations. Perceptibility is rated fair.
Physicality of the parameters is poor and behavior of the parameters is good if the
changes are taken into account in the analysis stage as well.
The method retains the identity of the modeled instrument well especially if
time-varying time-scale modication is applied (Serra, 1997a) (good). Generality is
good and it is heavily based on the analysis stage (good).
The implementation of the method can be done relatively eciently (fair) by
using the FFT. The memory requirements are fair and the control stream is dense
(poor) as the phase vocoder uses a transform of an original signal. Latency time is
large (poor) because of the block-based FFT computation. Suitability for parallel
processing is fair as the synthesis stage is mainly composed of IFFT. It was decided
that all FFT-based methods are rated fair on suitability for parallel processing since
although the computation of an FFT can be eciently parallelized, it is typically
computed in a single process.
7.3.3 McAulay-Quatieri Algorithm
The McAulay-Quatieri algorithm is based on a sinusoidal representation of signals.
It uses additive synthesis as its synthesis part and can thus be interpreted as an
analysis and data reduction method for the simple additive synthesis.
The control parameters of the MQ algorithm consist of amplitude, frequency,
and phase trajectories. These trajectories are interpolated to obtain the additive
synthesis parameters. As with the phase vocoder the intuitivity is rated fair, the
perceptibility fair, physicality poor, and behavior good.
The algorithm works with arbitrary signals if the number of sinusoidal oscillators
is increased accordingly but is infeasible for noisy signals (fair). Analysis method is
good and the sound identity is retained fairly well with modications.
The implementation is fairly ecient. The control stream (fair) is reduced by
the cost of interpolation of trajectories. The trajectories take less memory than the
envelopes of the additive synthesis and the memory usage is rated fair. Latency of
the synthesis part is better (fair) than in the phase vocoder as it is related to the
hop size instead of the FFT size. Suitability for parallel processing is rated fair as
with additive synthesis.
7.3.4 Source-Filter Synthesis
The parameters of source-lter modeling include the choice of excitation signal,
fundamental frequency if it exists, and the coecients of the time-varying lter.
These do not seem very intuitive but when the lter is parameterized properly,
formants can be controlled easily. Intuitivity is thus rated fair. Perceptibility also
is rated fair as changes in excitation signal are easily audible. The audibility of
lter parameters depends again on parameterization. When source-lter synthesis
is applied to simulate the human sound production system, the parameters are fairly
95
Chapter 7. Evaluation of Several Sound Synthesis Methods
physical as the formants correspond to the shape of the vocal tract. In time-varying
ltering, transition eects caused by updating the lter parameters are problematic
and can easily become audible as disturbing artifacts. This causes the behavior of
parameters to be rated poor.
Source-lter synthesis is general (good) in that it can, in theory, produce any
sound. For example, linear prediction oers an analysis method to obtain the lter
coecients and inverse-ltering can be utilized to obtain an excitation signal (good).
Robustness of identity (fair) depends on the parameterization but it can be easily
lost if the lter parameters are not updated carefully.
Excitation and ltering are fairly ecient to implement. The method does not
require a great deal of memory (fair). The control stream depends on the modeled
signal, for steady state signal it is very sparse but for speech the lter coecients have
to be updated every few milliseconds (fair). Latency time of source-lter synthesis
is small (good). Parallel distribution does not seem to pose any great advantages as
a large part of the computational cost comes from lter coecient updates (poor).
7.3.5 Spectral Modeling Synthesis
Spectral modeling synthesis uses additive synthesis to produce the deterministic
(harmonic) component and source-lter synthesis to produce the stochastic (noisy)
component of the synthetic signal. The parameters consist of amplitude and frequency trajectories of the deterministic component and spectral envelopes of the
stochastic part. These parameters are rated fair in intuitivity. For modication of
the analyzed signal to be meaningful, higher level controls have to be utilized to
reduce the control data. Perceptibility and physicality are thus rated poor. The
behavior of the parameters is good.
Robustness of identity is rated good as the composition of the signal provides
means to edit the deterministic and stochastic part separately. This allows for
better control of attack and steady state parts as with, e.g., the phase vocoder.
Spectral modeling synthesis is judged to be good in generality and analysis method.
Computational cost is reasonable (fair). The method requires more memory than
the MQ algorithm but it is still rated fair in memory usage. The control stream is
fairly sparse as the additive source-lter parameters are interpolated between STFT
frames. Latency time is in the order of that of the MQ algorithm (fair). Additive
and source-lter synthesis can be divided into separate parallel processors and thus
the suitability for distributed parallel processing is fair.
7.3.6 Transient Modeling synthesis
Transient modeling synthesis is an extension to the spectral modeling synthesis in
that it allows for further processing of the residual signal as separate noise and
transient signals. It is fair to say that TMS is more general than SMS and that it
also involves more computation. The two methods are close enough for the ratings
to be identical.
96
7.3. Evaluation of Spectral Models
7.3.7 FFT;
1
FFT;1 is an additive synthesis method in the frequency domain that is also capable
of producing noisy signals. The parameters consist of frequencies and amplitudes
of partials and of bandwidths and amplitudes of noisy components. They are rated
fair in intuitivity. The parameters are poorly perceptible as in a complex signal the
number of signal components can be very large. They are not physical (poor) but
they are linear and behave well (good).
The method is general (good) as it can produce harmonic, inharmonic, and noisy
signals. STFT provides a good analysis method. As with additive synthesis, the
robustness of sound identity is not evaluated for FFT;1.
The implementation of the method is ecient (good). The memory usage is fair
but the control stream can become very dense (poor) when the number of signal
components increases. FFT;1 is a block-based method and suers from latency
problems (poor). As the method uses IFFT to in the synthesis stage, it is rated fair
for suitability for parallel processing.
7.3.8 Formant Wave-Function Synthesis
The parameters of FOF synthesis govern the fundamental frequency and the structure of the formants of synthesized sound signals. The parameters can be judged
fairly intuitive and physical especially when the method is used for simulation of
the human sound production system. Perceptibility is good as there are typically
only a few formants present in speech or singing voice signals. The parameters are
well-behaved (good).
The method is fairly general as it can produce high-quality harmonic sound
signals of singing and musical instruments. Linear prediction provides for an analysis
method (good), and the sounds produced retain their identity well (good) as is proven
by sound examples (Bennett and Rodet, 1989).
The method can be implemented eciently when dierent FOFs are stored in
wavetables (good). This increases the amount of required memory (fair). The control
rate is fairly sparse and the latency time is small (good). Parallel processing of a
single FOF instrument does not seem feasible as the excitation signal is shared by
all of the FOF generators (poor).
7.3.9 VOSIM
The parameters of the VOSIM model are not very well related to either the sound
production mechanism being modeled or the sound itself. Thus physicality and
intuitivity of the parameters are both rated poor. The perceptibility is rated fair as
some of the parameters are strong and some weak. The behavior is also rated fair
because, although the method is linear, the eect of each parameter to the sound
produced may not be very well-behaved.
97
Chapter 7. Evaluation of Several Sound Synthesis Methods
The method is fairly general as it has been used to model the human sound production mechanism and some musical instruments. An ecient analysis method was
not found in the literature (poor). The parameterization suggests that the method
is not robust with parameter modication (poor). VOSIM can be implemented efciently and it only requires a small amount of memory (both rated good). The
control stream is sparse (good) and latency time can be kept small (good). There is
little advantage in parallel implementation of the system (poor).
7.4 Evaluation of Physical Models
7.4.1 Finite Di
erence Methods
The parameters of nite dierence methods correspond directly to the physical parameters of the modeled sound production system. Thus they are very physical and
intuitive, and can also be rated good for perceptibility and behavior as the vibratory
motion is assumed linear.
Although the method can be applied in theory to arbitrary sound production
systems, a new implementation is typically required as the instrument under study
is changed. Thus, FD methods are rated fair in generality. A tuned model of an
instrument behaves very much like the original and retains the identity well (good).
Analysis methods are available but although the results can be very good, they often
include specialized measurement instruments and require a great deal of time and
eort (fair).
FD methods are computationally very inecient (poor), and they also need a fair
amount of memory. The control stream depends on the excitation but is very sparse
with plucked or struck strings and with mallet instruments (good). The method
does not pose a problem with latency times if sucient computational capacity is
available (good). The method is well suited (good) for distributed parallel processing
as signicant improvements can be achieved by dividing the system into several
substructures running as dierent processes.
7.4.2 Modal Synthesis
Modal synthesis parameters consist of the modal data of the modeled structure and
the excitation or driving data. The parameters are not very intuitive (fair), and a
change in a single mode can be hard to perceive if the number of modes is large
(poor). They correspond directly to physical structures and the physicality and the
behavior are rated good.
Analysis methods for the system are available and they produce reliable and
accurate results. However, they suer from being very complicated and expensive
(fair). The system is general as any vibrating object can be formulated as its modal
data. Arbitrary sounds related to no physical objects are not easily produced by
the mechanism. Generality is rated fair. The modeled structure retains its identity
98
7.4. Evaluation of Physical Models
very well (good) as it is typically controlled by the excitation signal.
The method is implemented as a set of parallel second-order resonators that are
computationally ecient. The number of resonators can grow very large and thus
the computational eciency is rated fair. The modal data requires a large amount of
memory (poor). The excitation signal denes the control stream for static structures
and it can be rated sparse (good). The substructures can be eciently distributed
and processed in parallel (good).
7.4.3 CORDIS
A clear description of the CORDIS system parameters was not found in the literature. For this reason the parameters of CORDIS are not evaluated.
As the method uses a physics-based description of the system that vibrates, it
retains the identity of sound well (good) with dierent meaningful excitation signals.
No analysis system was described in the literature (poor). The generality of the
system is fair as it can, in theory, model arbitrary vibrating objects. The method
does not provide an easy way to create arbitrary sounds.
The method is judged to be computationally fair as although the basic elements
can be computed eciently, a large number of them is needed. This accounts also
for rating fair for memory requirements. The control rate depends on the parameterization and is not evaluated here. The latency time of CORDIS is small (good). It
appears that CORDIS may be well suited for parallel processing (good) (Rocchesso,
1998).
7.4.4 Digital Waveguide Synthesis
Digital waveguide synthesis parameters are intuitive and physical as they correspond
well to physical structures of the instrument and the way it is being played (both
rated good). The parameter changes are typically audible (good) and they behave
well with linear models. As some of the waveguide models have nonlinear couplings,
the behavior is rated fair.
The identity of the instrument is retained very well (good). The method can be
used to simulate instruments with one-dimensional vibrating objects, such as string
and wind instruments, and it is thus rated fair in generality. Automated analysis
methods for linear models are ecient but they are not available for nonlinear models
(fair).
A digital waveguide can be implemented very eciently but typically the model
also incorporates other structures that increase the computational requirements.
Computational eciency is rated fair. Digital waveguide models require little memory other than the excitation signal. A high-quality plucked string tone of several
seconds can be produced with only several thousands words of memory (good). The
control stream depends on the instrument being modeled and is here rated fair. The
method does not pose latency problems and especially models with several vibrating
99
Chapter 7. Evaluation of Several Sound Synthesis Methods
structures can be eciently divided into substructures that are computed in parallel
(both rated good).
7.4.5 Waveguide Meshes
The parameters of the waveguide meshes are fairly intuitive as they correspond to
the excitation and the properties of the 2D or 3D vibrating system. Parameters are
physical and perceptible and they behave well as the mesh is linear (rated good for
all those criteria).
Analysis methods were not found in the literature (poor). The method is fairly
general as it can be applied to simulation of 2D and 3D objects. The robustness of
identity is not evaluated.
The method is computationally expensive and requires a large amount of memory
(both rated poor). The control stream is fairly sparse as it consists only of the
excitation information. The method itself does not pose latency problems (good),
although real-time implementations of more complex structures cannot be achieved
without expensive supercomputers. One of the main advantages of the model is that
it can be divided into arbitrary substructures that are computed in parallel (good).
7.4.6 Commuted Waveguide Synthesis
Commuted digital waveguides have been used to produce high-quality synthesis
of instruments that can be described as having linear or linearizable coupling of
excitation to the vibrating structure. The parameters are very intuitive, perceptible,
physical and well-behaved, and commuted waveguide synthesis is rated good for those
criteria.
The method is very good in retaining the identity of the modeled instrument. For
good synthesis results, parameters need to be derived by analysis of existing instruments. The analysis methods employ STFT and produce good results. The method
is fairly general as a number of percussive, plucked, or struck string instruments can
be modeled with commuted synthesis.
The implementation issues of commuted waveguide synthesis are very close to
digital waveguide synthesis. The ratings are the same and they will be repeated
for convenience. Computational eciency and control stream are rated fair, and
memory usage, latency, and suitability for parallel processing good.
100
7.5 Results of Evaluation
7.5. Results of Evaluation
The evaluation results discussed in the previous sections are tabulated in Table
7.1. It can be observed that the abstract algorithms and sampling techniques are
strongest int the implementation category. Spectral models are general, robust, and
analysis methods are available. They are strongest in the sound category. Physical modeling employs very intuitive parameterization, and it is strongest in the
parameter category.
101
Chapter 7. Evaluation of Several Sound Synthesis Methods
Abstract
Parameters
Sound
Implementation
Int Perc Phys Behav Robust Gen Anal Comp Mem Contr Lat Par
FM
Waveshaping
KS
Sampling
Sampling
Multiple WT
Granular
Spectral
Additive
Phase Vocoder
MQ
Source-lter
SMS
TMS
FFT;1
CHANT
VOSIM
Physical
Tabulated evaluation of the sound synthesis methods presented in this document.
Modal
CORDIS
FD methods
Waveguide
WG Meshes
Commuted WG Table 7.1:
102
8. Summary and Conclusions
In this document, several modern sound synthesis methods have been discussed.
The methods were divided into four groups according to a taxonomy proposed by
Smith (1991). Representative examples in each group were chosen, and a describtion
of those methods was given. The interested reader was referred to the literature for
more information on the methods.
Three methods based on abstract algorithms were chosen: FM synthesis, waveshaping synthesis, and the Karplus-Strong algorithm. Also, three methods utilizing
recordings of sounds were discussed. These are sampling, multiple wavetable synthesis, and granular synthesis.
In the spectral modeling category three traditional linear sound synthesis methods, namely, additive synthesis, the phase vocoder, and source-lter synthesis, were
rst discussed. Second, McAulay-Quatieri algorithm, Spectral Modeling Synthesis,
Transient Modeling Synthesis and the inverse-FFT based additive synthesis method
(FFT;1 synthesis) were described. Finally, two methods for modeling the human
voice were shortly addressed. These methods are the CHANT and the VOSIM.
Three physical modeling methods that use numerical acoustics were investigated.
First, models using nite dierence methods were presented. Applications to string
instruments as well as to mallet percussion instruments were presented. Second,
modal synthesis was discussed. Third, CORDIS, a system of modeling vibrating
objects by mass-spring networks, was described.
Continuing in the physical modeling category, digital waveguides were discussed.
Waveguide meshes, which are 2-D and 3-D models, were also presented. Extensions
and physical-modeling interpretation of the Karplus-Strong algorithm was discussed,
and single delay loop (SDL) models were described. Finally, a case study of modeling
the acoustic guitar using commuted waveguide synthesis was presented.
After the methods in the four categories were discussed, evaluation criteria based
on those proposed by (Jae, 1995) were described. One additional criterion was
added addressing the suitability of a method for parallel processing. Each method
was evaluated with a discussion concerning each evaluation criterion. The criteria
were rated with qualitative measure for each method. Finally, the ratings were tabulated in a comparable form. It was observed that abstract algorithms and sampling
techniques are strongest in the implementation category. Spectral models are gen103
Chapter 8. Summary and Conclusions
eral, robust, and analysis methods are available. They are strongest in the sound
category. Physical modeling algorithms employ very intuitive parameterization, and
are strongest in the parameter category.
104
Bibliography
Adrien, J. M. 1989. Dynamic modeling of vibrating structures for sound synthesis,
modal synthesis, Proceedings of the AES 7th International Conference, Audio
Engineering Society, Toronto, Canada, pp. 291300.
Adrien, J.-M. 1991. The missing link: modal synthesis, in: G. D. Poli, A. Piccialli and C. Roads (eds), Representations of Musical Signals, The MIT Press,
Cambridge, Massachusetts, USA, pp. 269297.
Arb, D. 1979. Digital synthesis of complex spectra by means of multiplication of nonlinear distorted sine waves, Journal of the Audio Engineering Society
27(10): 757768.
Bate, J. A. 1990. The eect of modulator phase on timbres in FM synthesis, Computer Music Journal 14(3): 3845.
Bennett, G. and Rodet, X. 1989. Synthesis of the singing voice, in: M. V. Mathews
and J. R. Pierce (eds), Current Directions in Computer Music Research, The MIT
Press, Cambridge, Massachusetts, chapter 4, pp. 1944.
Borin, G. and Giovanni, D. P. 1996. A hysteretic hammer-string interaction
model for physical model synthesis, Proceedings of the Nordic Acouctical Meeting, Helsinki, Finland, pp. 399406.
Borin, G., De Poli, G. and Rocchesso, D. 1997a. Elimination of delay-free loops
in discrete-time models of nonlinear acoustic systems, Proceedings of the IEEE
Workshop of Applications of Signal Processing to Audio and Acoustics, New Paltz,
New York.
Borin, G., De Poli, G. and Sarti, A. 1997b. Musical signal synthesis, in: C. Roads,
S. T. Pope, A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets &
Zeitlinger, Lisse, the Netherlands, chapter 1, pp. 530.
Bristow-Johnson, R. 1996. Wavetable synthesis 101, a fundamental perspective,
Proceedings of the 101st AES convention in Los Angeles, California.
Cadoz, C., Luciani, A. and Florens, J. 1983. Responsive input devices and sound
synthesis by simulation of instrumental mechanisms: the CORDIS system, Computer Music Journal 8(3): 6073.
BIBLIOGRAPHY
Cavaliere, S. and Piccialli, A. 1997. Granular synthesis of musical signals, in:
C. Roads, S. T. Pope, A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets & Zeitlinger, Lisse, the Netherlands, chapter 5, pp. 155186.
Chaigne, A. 1991. Viscoelastic properties of nylon guitar strings, Catgut Acoustical
Society Journal 1(7): 2117.
Chaigne, A. 1992. On the use of nite dierences for musical synthesis. Application
to plucked stringed instruments, Journal d'Acoustique 5(2): 181211.
Chaigne, A. and Askenfelt, A. 1994a. Numerical simulations of piano strings. I. A
physical model for a struck string using nite dierence methods, Journal of the
Acoustical Society of America 95(2): 11121118.
Chaigne, A. and Askenfelt, A. 1994b. Numerical simulations of piano strings. II.
Comparisons with measurements and systematic exploration of some hammerstring parameters, Journal of the Acoustical Society of America 95(3): 16311640.
Chaigne, A. and Doutaut, V. 1997. Numerical simulations of xylophones. I. Timedomain modeling of the vibrating bars, Journal of the Acoustical Society of America 101(1): 539557.
Chaigne, A., Askenfelt, A. and Jansson, E. V. 1990. Temporal synthesis of string
instrument tones, Quarterly Progress and Status Report, number 4, Speech Transmission Laboratory, Royal Institute of Technology (KTH), Stockholm, Sweden,
pp. 81100.
Chowning, J. M. 1973. The synthesis of comples audio spectra by means of frequency modulation, Journal of the Audio Engineering Society 21(7): 526534.
Reprinted in C. Roads and J. Strawn, eds. 1985. Foundations of Computer Music.
Cambridge, Massachusetts: The MIT Press. pp. 6-29.
Cook, P. R. 1991. TBone: and interactive waveguide brass instrument synthesis
workbench for the NeXT machine, Proceedings of the International Computer
Music Conference, Montreal, Canada, pp. 297299.
Cook, P. R. 1992. A meta-wind-instrument physical model, and a meta-controller
for real time performance control, Proceedings of the International Computer Music Conference, International Computer Music Association, San Francisco, CA.,
pp. 273276.
Cook, P. R. 1993. SPASM, a real-time vocal tract physical model controller
and singer, the companion software synthesis system, Computer Music Journal
17(1): 3044.
De Poli, G. 1983. A tutorial on digital sound synthesis techniques, Computer Music
Journal 7(2): 7687. Also published in Roads C. (ed). 1989. The Music Machine,
pp. 429447. The MIT Press. Cambridge, Massachusetts, USA.
De Poli, G. and Piccialli, A. 1991. Pitch-synchronous granular synthesis, in: G. D.
Poli, A. Piccialli and C. Roads (eds), Representations of Musical Signals, The
MIT Press, Cambridge, Massachusetts, USA, pp. 391412.
106
BIBLIOGRAPHY
Delprat, N. 1997. Global frequency modulation laws extraction from the Gabor
transform of a signal: a rst study of the interacting components case, IEEE
Transactions on Speech and Audio Processing 5(1): 6471.
Dietz, P. H. and Amir, N. 1995. Synthesis of trumpet tones by physical modeling,
Proceedings of the International Symposium on Musical Acoustics, pp. 472477.
Dolson, M. 1986. The phase vocoder: a tutorial, Computer Music Journal 10(4): 14
27.
Dudley, H. 1939. The vocoder, Bell Laboratories Record 17: 122126.
Eckel, G., Iovino, F. and Causs!, R. 1995. Sound synthesis by physical modelling
with Modalys, Proceedings of the International Symposium on Musical Acoustics,
Dourdan, France, pp. 479482.
Evangelista, G. 1993. Pitch-synchronous wavelet representation of speech and music
signals, IEEE Transactions on Signal Processing 41(12): 33123330.
Evangelista, G. 1994. Comb and multiplexed wavelet transforms and their applications to signal processing, IEEE Transactions on Signal Processing 42(2): 292
303.
Evangelista, G. 1997. Wavelet representations of musical signals, in: C. Roads,
S. T. Pope, A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets &
Zeitlinger, Lisse, the Netherlands, chapter 4, pp. 127153.
Fitz, K. and Haken, L. 1996. Sinusoidal modeling and manipulation using Lemur,
Computer Music Journal 20(4): 4459.
Flanagan, J. L. and Golden, R. M. 1966. Phase Vocoder, The Bell System Technical
Journal 45: 14931509.
Fletcher, N. H. and Rossing, T. D. 1991. The Physics of Musical Instruments,
Springer-Verlag, New York, USA, p. 620.
Florens, J.-L. and Cadoz, C. 1991. The physical model: modeling and simulating
the instrumental universe, in: G. D. Poli, A. Piccialli and C. Roads (eds), Representations of Musical Signals, The MIT Press, Cambridge, Massachusetts, USA,
pp. 227268.
Fontana, F. and Rocchesso, D. 1995. A new formulation of the 2D-waveguide mesh fo
percussion instruments, Proceedings of the XI Colloquium on Musical Informatics,
Bologna, Italy, pp. 2730.
Goodwin, M. and Gogol, A. 1995. Overlap-add synthesis of nonstationary sinusoids,
Proceedings of the International Computer Music Conference, Ban, Canada,
pp. 355356.
Goodwin, M. and Rodet, X. 1994. Ecient Fourier synthesis of nonstationary sinusoids, Proceedings of the International Computer Music Conference, Aarhus,
Denmark, pp. 333334.
107
BIBLIOGRAPHY
Goodwin, M. and Vetterli, M. 1996. Time-frequency signal models for music analysis,
transformation, and synthesis, Proceedings of the 3rd IEEE Symposium on TimeFrequency and Time Scale Analysis, Paris, France.
Gordon, J. W. and Strawn, J. 1985. An introduction to the phase vocoder, in:
J. Strawn (ed.), Digital Audio Signal Processing: An Anthology, William Kaumann, Inc., chapter 5, pp. 221270.
Harris, F. J. 1978. On the use of windows for harmonic analysis with the discrete
Fourier transform, Proceedings of the IEEE 66(1): 5183.
Hiller, L. and Ruiz, P. 1971a. Synthesizing musical sounds by solving the wave
equation for vibrating objects: part 1, Journal of the Audio Engineering Society
19(6): 462470.
Hiller, L. and Ruiz, P. 1971b. Synthesizing musical sounds by solving the wave
equation for vibrating objects: part 2, Journal of the Audio Engineering Society
19(7): 542550.
Hirschman, S. E. 1991. Digital Waveguide Modeling and Simulation of Reed Woodwind Instruments, Technical Report STAN-M-72, Stanford University, Dept. of
Music, Stanford, California.
Holm, F. 1992. Understanding FM implementations: a call for common standards,
Computer Music Journal 16(1): 3442.
Horner, A., Beauchamp, J. and Haken, L. 1993. Methods for multiple wavetable
synthesis of musical instrument tones, Journal of the Audio Engineering Society
41(5): 336356.
Huopaniemi, J., Karjalainen, M., Vlimki, V. and Huotilainen, T. 1994. Virtual
instruments in virtual rooms"a real-time binaural room simulation environment
for physical modeling of musical instruments, Proceedings of the International
Computer Music Conference, Aarhus, Denmark, pp. 455462.
Jae, D. A. 1995. Ten criteria for evaluating synthesis techniques, Computer Music
Journal 19(1): 7687.
Jae, D. A. and Smith, J. O. 1983. Extensions of the Karplus-Strong pluckedstring algorithm, Computer Music Journal 7(2): 5669. Also published in Roads
C. (ed). 1989. The Music Machine, pp. 481494. The MIT Press. Cambridge,
Massachusetts, USA.
Kaegi, W. and Tempelaars, S. 1978. VOSIM"a new sound synthesis system, Journal
of the Audio Engineering Society 26(6): 418426.
Karjalainen, M. and Laine, U. K. 1991. A model for real-time sound synthesis of
guitar on a oating-point signal processor, Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, Vol. 5, Toronto, Canada,
pp. 36533656.
108
BIBLIOGRAPHY
Karjalainen, M. and Smith, J. O. 1996. Body modeling techniques for string instrument synthesis, Proceedings of the International Computer Music Conference,
Hong Kong, pp. 232239.
Karjalainen, M., Huopaniemi, J. and Vlimki, V. 1995. Direction-dependent physical modeling of musical instruments, Proceedings of the International Congress
on Acoustics, Vol. 3, Trondheim, Norway, pp. 451454.
Karjalainen, M., Laine, U. K. and Vlimki, V. 1991. Aspects in modeling and
real-time synthesis of the acoustic guitar, Proceedings of the IEEE Workshop of
Applications of Signal Processing to Audio and Acoustics, New Paltz, New York,
USA.
Karjalainen, M., Vlimki, V. and J#nosy, Z. 1993. Towards high-quality sound
synthesis of the guitar and string instruments, Proceedings of the International
Computer Music Conference, Tokyo, Japan, pp. 5663.
Karjalainen, M., Vlimki, V. and Tolonen, T. 1998. Plucked string models"from
Karplus-Strong algorithm to digital waveguides and beyond, Accepted for publication in Computer Music Journal.
Karplus, K. and Strong, A. 1983. Digital synthesis of plucked-string and drum timbres, Computer Music Journal 7(2): 4355. Also published in Roads C. (ed). 1989.
The Music Machine. pp.467-479. The MIT Press. Cambridge, Massachusetts.
Kurz, M. and Feiten, B. 1996. Physical modelling of a sti string by numerical
integration, Proceedings of the International Computer Music Conference, Hong
Kong, pp. 361364.
Laakso, T. I., Vlimki, V., Karjalainen, M. and Laine, U. K. 1996. Splitting the unit
delay"tools for fractional delay lter design, IEEE Signal Processing Magazine
13(1): 3060.
Lang, M. and Laakso, T. I. 1994. Simple and robust method for the design of allpass
lters using least-squares phase error criterion, IEEE Transactions on Circuits and
SystemsII: Analog and Digital Signal Processingx 41(1): 4048.
Laroche, J. and Dolson, M. 1997. About this phasiness business, Proceedings of the
International Computer Music Conference, Thessaloniki, Greece, pp. 5558.
Laroche, J. and Jot, J.-M. 1992. Analysis/synthesis of quasi-harmonic sound by
use of the Karplus-Strong algorithm, Proceedings of the 2nd French Congress on
Acoustics, Archachon, France.
Le Brun, M. 1979. Digital waveshaping synthesis, Journal of the Audio Engineering
Society 27(4): 250266.
Makhoul, J. 1975. Linear prediction: a tutorial review, Proceedings of the IEEE
63: 561580.
109
BIBLIOGRAPHY
McAulay, R. J. and Quatieri, T. F. 1986. Speech analysis/synthesis based on a
sinusoidal representation, IEEE Transactions on Acoustics, Speech, and Signal
Processing 34(6): 744754.
Moore, F. R. 1990. Elements of Computer Music, Prentice Hall, Englewood Clis,
New Jersey.
Moorer, J. A. 1978. The use of the phase vocoder in computer music applications,
Journal of the Audio Engineering Society 26(1/2): 4245.
Moorer, J. A. 1979. The use of linear prediction of speech in computer music
applications, Journal of the Audio Engineering Society 27(3): 134140.
Moorer, J. A. 1985. Signal processing aspects of computer music: a survey, in:
J. Strawn (ed.), Digital Audio Signal Processing: An Anthology, William Kaumann, Inc., chapter 5, pp. 149220.
Morrison, J. and Adrien, J. 1993. MOSAIC: a framework for modal synthesis,
Computer Music Journal 17(1): 4556.
Morse, P. M. and Ingard, U. K. 1968. Theoretical Acoustics, Princeton University
Press, Princeton, New Jersey, USA.
Msallam, R., Dequidt, S., Tassart, S. and Causs!, R. 1997. Physical model of the
trombone including nonlinear propagation eects, Proceedings of the Institute of
Acoustics, Vol. 19, pp. 245250. Presented at the International Symposium on
Musical Acoustics, Edinburgh, UK.
Nuttall, A. H. 1981. Some windows with very good sidelobe behavior, IEEE Transactions on Acoustics, Speech, and Signal Processing 29(1): 8491.
Oppenheim, A. V., Willsky, A. S. and Young, I. T. 1983. Signals and Systems,
Prentice-Hall, New Jersey, USA, p. 796.
Paladin, A. and Rocchesso, D. 1992. A dispersive resonator in real-time on MARS
workstation, Proceedings of the International Computer Music Conference, San
Jose, California, USA, pp. 146149.
Portno, M. R. 1976. Implementation of the digital phase vocoder using the fast
Fourier transform, IEEE Transactions on Acoustics, Speech, and Signal Processing
24(3): 243248.
Rank, E. and Kubin, G. 1997. A waveguide model for slapbass synthesis, Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
Vol. 1, Munich, Germany, pp. 443446.
Risset, J.-C. 1985. Computer music experiments 1964- : : : , Computer Music Journal
9(1): 6774. Also published in Roads C. (ed). 1989. The Music Machine. pp. 67
74. The MIT Press. Cambridge, Massachusetts, USA.
110
BIBLIOGRAPHY
Roads, C. 1991. Asynchronous granular synthesis, in: G. D. Poli, A. Piccialli and
C. Roads (eds), Representations of Musical Signals, The MIT Press, Cambridge,
Massachusetts, USA, pp. 143185.
Roads, C. 1995. The Computer Music Tutorial, The MIT Press, Cambridge, Massachusetts, USA, p. 1234.
Rocchesso, D. 1998. Personal communication.
Rocchesso, D. and Scalcon, F. 1996. Accurate dispersion simulation for piano strings,
Proceedings of the Nordic Acouctical Meeting, Helsinki, Finland, pp. 407414.
Rocchesso, D. and Turra, F. 1993. A generalized excitation for real-time sound
synthesis by physical models, Proceedings of the Stockholm Music Acoustics Conference, Stockholm, Sweden, pp. 584588.
Rodet, X. 1980. Time-domain formant-wave-function synthesis, Computer Music
Journal 8(3): 914.
Rodet, X. and Depalle, P. 1992a. A new additive synthesis method using inverse
Fourier transform and spectral envelopes, in: A. Strange (ed.), Proceedings of the
International Computer Music Conference, pp. 410411.
Rodet, X. and Depalle, P. 1992b. Spectral envelopes and inverse FFT synthesis,
Proceedings of the 93rd AES convention, San Francisco, California.
Rodet, X., Potard, Y. and Barrire, J.-B. 1984. The CHANT project: from synthesis
of the singing voice to synthesis in general, Computer Music Journal 8(3): 1531.
Savioja, L. and Vlimki, V. 1996. The bilinearly deinterpolated waveguide mesh,
Proceedings of the 1996 IEEE Nordic Signal Processing Symposium, Espoo, Finland, pp. 443446.
Savioja, L. and Vlimki, V. 1997. Improved discrete-time modeling of multidimensional wave propagation using the interpolated digital waveguide mesh, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing, Vol. 1, Munich, Germany.
Serra, M.-H. 1997a. Introducing the phase vocoder, in: C. Roads, S. T. Pope,
A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets & Zeitlinger,
Lisse, the Netherlands, chapter 2, pp. 3190.
Serra, X. 1989. A System for Sound Analysis/Transformation/Synthesis Based on
a Deterministic plus Stochastic Decomposition, PhD thesis, Stanford University,
California, USA, p. 151.
Serra, X. 1997b. Musical sound modeling with sinusoids plus noise, in: C. Roads,
S. T. Pope, A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets &
Zeitlinger, Lisse, the Netherlands, chapter 3, pp. 91122.
111
BIBLIOGRAPHY
Serra, X. and Smith, J. O. 1990. Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition, Computer Music Journal 14(4): 1224.
Smith, J. O. 1983. Techniques for Digital Filter Design and System Identication
with Application to the Violin, PhD thesis, Stanford University, California, USA,
p. 260.
Smith, J. O. 1986. Ecient simulation of the reed-bore and bow-string mechanisms,
Proceedings of the International Computer Music Conference, The Hague, the
Netherlands, pp. 275280.
Smith, J. O. 1987. Music applications of digital waveguides, Technical Report STANM-39, CCRMA, Dept. of Music, Stanford University, California, USA, p. 181.
Smith, J. O. 1991. Viewpoints on the history of digital synthesis, Proceedings of the
International Computer Music Conference, Montreal, Canada, pp. 110.
Smith, J. O. 1992. Physical modeling using digital waveguides, Computer Music
Journal 16(4): 7491.
Smith, J. O. 1993. Ecient synthesis of stringed musical instruments, Proceedings
of the International Computer Music Conference, Tokyo, Japan, pp. 6471.
Smith, J. O. 1995. Introduction to digital waveguide modeling of musical instruments, Unpublished manuscript.
Smith, J. O. 1996. Physical modeling synthesis update, Computer Music Journal
20(2): 4456.
Smith, J. O. 1997. Acoustic modeling using digital waveguides, in: C. Roads,
S. T. Pope, A. Piccialli and G. De Poli (eds), Musical Signal Processing, Swets &
Zeitlinger, Lisse, the Netherlands, chapter 7, pp. 221264.
Smith, J. O. and Serra, X. 1987. PARSHL: an analysis/synthesis program for nonharmonic sounds based on a sinusoidal representation, Proceedings of the International Computer Music Conference, Urbana-Champaign, Illinois, USA, pp. 290
297.
Smith, J. O. and Van Duyne, S. A. 1995. Commuted piano synthesis, Proceedings
of the International Computer Music Conference, Ban, Canada, pp. 335342.
Stilson, T. and Smith, J. 1996. Alias-free digital synthesis of classical analog waveforms, Proceedings of the International Computer Music Conference, Hong Kong,
pp. 332335.
Strawn, J. 1980. Approximation and syntactic analysis of amplitude and frequency
functions for digital sound synthesis, Computer Music Journal 4(3): 324.
Sullivan, C. S. 1990. Extending the Karplus-Strong algorithm to synthesize electric
guitar timbres with distortion and feedback, Computer Music Journal 14(3): 26
37.
112
BIBLIOGRAPHY
Tolonen, T. 1998. Model-based Analysis and Resynthesis of Acoustic Guitar Tones,
Master's thesis, Helsinki University of Technology, Espoo, Finland, p. 102. Report
46, Laboratory of Acoustics and Audio Signal Processing.
Tolonen, T. and Vlimki, V. 1997. Automated parameter extraction for plucked
string synthesis, Proceedings of the Institute of Acoustics, Vol. 19, pp. 245250.
Presented at the International Symposium on Musical Acoustics, Edinburgh, UK.
Tomisawa, N. 1981. Tone production method for an electronic musical instrument,
U.S. Patent 4,249,447.
Truax, B. 1977. Organizational techniques for c:m ratios in frequency modulation,
Computer Music Journal 1(4): 3945. Reprinted in C. Roads and J. Strawn,
eds. 1985. Foundations of Computer Music. Cambridge, Massachusetts: The MIT
Press. pp. 68-82.
Van Duyne, S. A. and Smith, J. O. 1993a. The 2-D digital waveguide, Proceedings of
the IEEE Workshop of Applications of Signal Processing to Audio and Acoustics,
New Paltz, New York, USA.
Van Duyne, S. A. and Smith, J. O. 1993b. Physical modeling with the 2-D digital
waveguide mesh, Proceedings of the International Computer Music Conference,
pp. 4047.
Van Duyne, S. A. and Smith, J. O. 1994. A simplied approach to modeling dispersion caused by stiness in strings and plates, Proceedings of the International
Computer Music Conference, Aarhus, Denmark, pp. 407410.
Van Duyne, S. A. and Smith, J. O. 1995a. Developments for the commuted piano, Proceedings of the International Computer Music Conference, Ban, Canada,
pp. 319326.
Van Duyne, S. A. and Smith, J. O. 1995b. The tetrahedral digital waveguide mesh,
Proceedings of the IEEE Workshop of Applications of Signal Processing to Audio
and Acoustics, New Paltz, New York.
Van Duyne, S. A. and Smith, J. O. 1996. The 3D tetrahedral digital waveguide
mesh with musical applications, Proceedings of the International Computer Music
Conference, International Computer Music Association, Hong Kong, pp. 916.
Vergez, C. and Rodet, X. 1997. Comparison of real trumpet playing, latex model
of lips and computer model, Proceedings of the International Computer Music
Conference, Thessaloniki, Greece, pp. 180187.
Verma, T. S., Levine, S. N. and Meng, T. H. Y. 1997. Transient modeling synthesis: a exible analysis/synthesis tool for transient signals, Proceedings of the
International Computer Music Conference, Thessaloniki, Greece, pp. 164167.
Vlimki, V. 1995. Discrete-Time Modeling of Acoustic Tubes Using Fractional
Delay Filters, PhD thesis, Helsinki University of Technology, Espoo, Finland,
p. 193.
113
BIBLIOGRAPHY
Vlimki, V. and Takala, T. 1996. Virtual musical instruments"natural sound using
physical models, Organised Sound 1(2): 7586.
Vlimki, V. and Tolonen, T. 1997a. Development and calibration of a guitar synthesizer, Presented at the 103rd Convention of the Audio Engineering Society,
Preprint 4594, New York, USA.
Vlimki, V. and Tolonen, T. 1997b. Multirate extensions for model-based synthesis
of plucked string instruments, Proceedings of the International Computer Music
Conference, Thessaloniki, Greece, pp. 244247.
Vlimki, V., Huopaniemi, J., Karjalainen, M. and J#nosy, Z. 1996. Physical modeling of plucked string instruments with application to real-time sound synthesis,
Journal of the Audio Engineering Society 44(5): 331353.
Vlimki, V., Karjalainen, M. and Laakso, T. I. 1993. Modeing of woodwind bores
with nger holes, Proceedings of the International Computer Music Conference,
pp. 3239.
Vlimki, V., Karjalainen, M., J#nosy, Z. and Laine, U. K. 1992a. A real-time DSP
implementation of a ute model, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, San Francisco, California,
pp. 249252.
Vlimki, V., Laakso, T. I. and Mackenzie, J. 1995. Elimination of transients in
time-varying allpass fractional delay lters with application to digital waveguide
modeling, Proceedings of the International Computer Music Conference, Ban,
Canada, pp. 303306.
Vlimki, V., Laakso, T. I., Karjalainen, M. and Laine, U. K. 1992b. A new computational model for the clarinet, in: A. Strange (ed.), Proceedings of the International Computer Music Conference, International Computer Music Association,
San Francisco, CA.
114
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement