Title, Modify - Aaltodoc - Aalto
Distributed Digital Signal Processing
System using Single Board Computers
Jussi Nieminen
School of Electrical Engineering
Thesis submitted for examination for the degree of Master of
Science in Technology.
Espoo 3.8.2016
Thesis supervisor:
Prof. Tapio Lokki
Thesis advisors:
D.Sc. (Tech.) Jukka Pätynen
D.Sc. (Tech.) Sakari Tervo
aalto university
school of electrical engineering
abstract of the
master’s thesis
Author: Jussi Nieminen
Title: Distributed Digital Signal Processing System using Single Board
Computers
Date: 3.8.2016
Language: English
Number of pages: 10+82
Department of Signal Processing and Acoustics
Professorship: Acoustics and audio signal processing
Supervisor: Prof. Tapio Lokki
Advisors: D.Sc. (Tech.) Jukka Pätynen, D.Sc. (Tech.) Sakari Tervo
Real-time multi-channel convolution using long impulse responses is currently
only achievable on expensive, dedicated equipment. The thesis implements a
low-latency multi-channel convolution system that distributes the processing to
several low-cost single board computers. The system is open source software based
on Linux, Pure Data, FFTW and audio over IP.
Additionally, the thesis explores the applicability of modern single board computers
in real-time signal processing. We also measure the latency and the audio quality
attributes of the implemented system. The results show that the Raspberry Pi 2
Model B equipped with the HiFiBerry DAC+ expansion board is still infeasible for
demanding virtual acoustics applications. However, the implementation achieves
stereo convolution with up to 1 s impulse responses at perceptually low latency.
Furthermore, the audio quality of the HiFiBerry DAC+ is confirmed to be adequate
for professional audio applications.
Keywords: digital signal processing, real-time convolution, single board computers, audio over IP, audio distortion measurements
aalto-yliopisto
sähkötekniikan korkeakoulu
diplomityön
tiivistelmä
Tekijä: Jussi Nieminen
Työn nimi: Opinnäyteohje
Päivämäärä: 3.8.2016
Kieli: Englanti
Sivumäärä: 10+82
Akustiikan ja signaalinkäsittelyn laitos
Professuuri: Akustiikka ja äänenkäsittelytekniikka
Työn valvoja: Prof. Tapio Lokki
Työn ohjaajat: TkT Jukka Pätynen, TkT Sakari Tervo
Reaaliaikainen multikanavakonvoluutio pitkiä impulssivasteita käyttäen on tällä
hetkellä mahdollista lähinnä kalliilla erikoislaitteistolla. Tässä diplomityössä
toteutetaan monikanavainen konvoluutiojärjestelmä, joka jakaa prosessoinnin
monelle edulliselle yhden piirilevyn tietokoneelle. Järjestelmä perustuu avoimeen
lähdekoodiin ja on toteutettu Linux käyttöjärjestelmälle. Ohjelmisto on kehitetty
Pure Data-alustalle hyödyntäen avoimen lähdekoodin FFT- ja Audio over
IP-kirjastoja.
Lisäksi diplomityö käsittelee yhden piirilevyn tietokoneiden soveltuvuutta reaaliaikaiseen signaalinkäsittelyyn. Työssä mitataan myös toteutetun järjestelmän viive
sekä keskeiset äänenlaatuun liittyvät ominaisuudet. Työn tuloksista voidaan nähdä,
että Raspberry Pi 2 Model B-tietokone HiFiBerry DAC+-äänipiirillä ei saavuta
riittävää suorituskykyä ehdotettuun virtuaaliakustiikan sovellukseen. Järjestelmä
suoriutuu silti kaksikanavaisesta konvoluutiosta reaaliajassa, kun käytetään sekunnin pituisia impulssivasteita. HiFiBerry DAC+-piirin äänenlaatu todetaan myös
riittäväksi moniin erilaisiin äänenkäsittelysovelluksiin.
Avainsanat: digitaalinen signaalinkäsittely, reaaliaikainen konvoluutio, yhden
piirilevyn tietokoneet, audio over IP, äänenlaadun mittaus
iv
Preface
I wish to thank Professor Tapio Lokki for supervision and the opportunity to use
the facilities and equipment of the virtual acoustics research group. I also wish
to thank my instructors Jukka Pätynen and Sakari Tervo for the topic idea, the
excellent guidance on the subject and the numerous pointers for good academic
writing techniques.
Additionally, I would like to thank the Foundation for Aalto University Science
and Technology for supporting the thesis with a grant. Without it, working on the
thesis would have been difficult.
Finally, I wish to thank the fellow students in my class for peer support and
friends and family for being there and giving the chance to unwind after hard days.
Last but not least, I wish to give special thanks to Kim for pushing me forward and
being that special someone.
Otaniemi, 3.8.2016
Jussi O. Nieminen
v
Contents
Abstract
ii
Abstract (in Finnish)
iii
Preface
iv
Contents
v
Symbols, operators and abbreviations
viii
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The proposed implementation . . . . . . . . . . . . . . . . . . . . . .
1.3 How to read this thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background
2.1 Digital signals . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Quantization . . . . . . . . . . . . . . . . . . . . . . .
2.2 The discrete-time Fourier transform . . . . . . . . . . . . . . .
2.2.1 The discrete Fourier transform . . . . . . . . . . . . . .
2.2.2 The fast Fourier transform . . . . . . . . . . . . . . . .
2.3 Discrete-time convolution . . . . . . . . . . . . . . . . . . . .
2.3.1 Linear convolution . . . . . . . . . . . . . . . . . . . .
2.3.2 Circular convolution . . . . . . . . . . . . . . . . . . .
2.3.3 Fast convolution . . . . . . . . . . . . . . . . . . . . .
2.4 Real-time convolution methods . . . . . . . . . . . . . . . . .
2.4.1 Signal segmentation . . . . . . . . . . . . . . . . . . . .
2.4.2 Overlap-add method . . . . . . . . . . . . . . . . . . .
2.4.3 Overlap-save method . . . . . . . . . . . . . . . . . . .
2.5 Networking concepts . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Computer networks . . . . . . . . . . . . . . . . . . . .
2.5.2 Connection-oriented and connectionless communication
2.5.3 Network protocols . . . . . . . . . . . . . . . . . . . . .
2.6 OSI Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 TCP/IP protocol stack . . . . . . . . . . . . . . . . . . . . . .
2.7.1 Application layer . . . . . . . . . . . . . . . . . . . . .
2.7.2 Transport layer . . . . . . . . . . . . . . . . . . . . . .
2.7.3 Internet layer . . . . . . . . . . . . . . . . . . . . . . .
2.7.4 Network access layer . . . . . . . . . . . . . . . . . . .
2.8 IP addressing . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 IP address classes . . . . . . . . . . . . . . . . . . . . .
2.8.2 Internet-unique and private IP addresses . . . . . . . .
2.8.3 Classless addressing . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
4
5
7
7
8
9
11
12
12
13
15
15
16
18
19
20
20
20
21
23
24
24
26
27
27
27
28
29
vi
2.9
2.8.4 Broadcast address . . . . . . . . . . . . . . . . . . . . . . . . . 29
Audio over IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Research material and methods
3.1 Single board computers . . . . . . . . . . . . . . . .
3.1.1 CPU considerations . . . . . . . . . . . . . .
3.1.2 RAM requirements . . . . . . . . . . . . . .
3.1.3 Comparison . . . . . . . . . . . . . . . . . .
3.1.4 Raspberry Pi 2 Model B and HiFiBerry . . .
3.2 Pure Data . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Graphical programming interface . . . . . .
3.2.2 Programming external objects for Pure Data
3.2.3 Compiling . . . . . . . . . . . . . . . . . . .
3.3 FFT library . . . . . . . . . . . . . . . . . . . . . .
3.4 Audio over IP . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
32
33
34
35
35
36
37
38
40
4 Implementation
4.1 Audio over IP externals . . . . . . . . . . . . .
4.2 The real-time convolution external . . . . . . .
4.2.1 Storing the input blocks . . . . . . . . .
4.2.2 Convolution algorithm . . . . . . . . . .
4.3 Host Pure Data patch . . . . . . . . . . . . . .
4.4 The receiving Pure Data patch . . . . . . . . .
4.5 Code optimization for ARM processors . . . . .
4.5.1 Enabling the Advanced SIMD extension
4.5.2 Cache optimization . . . . . . . . . . . .
4.5.3 Data alignment . . . . . . . . . . . . . .
4.5.4 Loop termination . . . . . . . . . . . . .
4.6 Multithreading . . . . . . . . . . . . . . . . . .
4.7 Raspbian configuration . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
42
42
44
46
46
49
49
50
50
51
51
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Audio distortion measurements
5.1 Audio distortion . . . . . . . . . . . . . . .
5.1.1 Linear and non-linear distortion . .
5.1.2 Harmonic distortion . . . . . . . .
5.1.3 Intermodulation distortion . . . . .
5.2 Logarithmic sine sweep method . . . . . .
5.2.1 Generating logarithmic sine sweeps
5.2.2 Pre-processing . . . . . . . . . . . .
5.2.3 Post-processing . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
53
53
54
55
56
57
57
6 Results
6.1 Measurement set-up . . . . .
6.2 Linear responses of the DACs
6.3 Harmonic distortion . . . . . .
6.4 Intermodulation distortion . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
61
64
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
6.5
6.6
Latency . . . . . . . . . . . . . . . . . . . .
6.5.1 Latency inducing System components
6.5.2 Network latency . . . . . . . . . . . .
6.5.3 Latencies caused by buffering . . . .
6.5.4 Overall latency . . . . . . . . . . . .
Code profiling . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
65
66
68
69
7 Summary
71
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Future research and development . . . . . . . . . . . . . . . . . . . . 72
References
A Build instructions
A.1 Makefile for bcast
A.2 Makefile for conv
A.3 benchfftw . . . .
A.4 fftw3 . . . . . . .
A.5 Oprofile . . . . .
74
and bcreceive
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
B Raspbian configuration
B.1 HiFiBerry DAC+ . . . . . . . .
B.2 Setting up a static IP address .
B.3 Overclocking . . . . . . . . . . .
B.4 CPU frequency scaling . . . . .
B.5 Enabling the real-time scheduler
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
79
79
79
80
.
.
.
.
.
81
81
81
82
82
82
viii
Symbols, operators and abbreviations
Symbols
∆t
e
fN , fs
j
k
m
n
t
ω
pe , prw
φ
q
r
A
Ddb
THD
B
C
K
L
M
N
S
T
f (t)
h[n]
hq [n]
h(t)
sm [n]
x[n]
xm [n]
x(t)
y[n]
ym [n]
y(t)
H[k]
SH , SX
WN
X[k]
Y [k]
Z
time difference
Euler’s number
Nyquist frequency, sampling frequency
imaginary number
frequency index
input block index
sample index
time
normalized frequency
end pointer, read/write pointer in C
phase offset
impulse response block index
window length parameter
amplitude
dynamic range
total harmonic distortion
block size
number of multiplications
number of harmonics
length of convolution result, number of blocks
impulse response length, amount of memory
sequence length, distortion order
audio buffer size
sweep length
reference sweep
transfer function in the discrete-time domain
impulse response block
transfer function in the time domain
overlapping block segment
input signal in the discrete-time domain
input signal block
input signal in the time domain
output signal in the discrete-time domain
output signal block
output signal in the time domain
transfer function in the frequency domain
C arrays for storing audio samples
Nth root of unity
input signal in the frequency domain
output signal in the frequency domain
result of complex multiplication
ix
Operators
ln
logn
∗
~
b
X
natural logarithm
base n logarithm
linear convolution
circular convolution
sum over index i in the interval [a, b]
i=a
d
Zdtb
a
derivative with respect to variable t
f (x)dx
definite integral of f (x) over the interval [a, b]
dxe
ceil
bxc
floor
hxiN
residue of x mod N
FFT{x[n]}
fast Fourier transform of sequence x[n]
IFFT{X[k]} inverse fast Fourier transform of sequence X[k]
x
Abbreviations
ADC
AES3
AoIP
API
ARPANET
CPU
DAC
DFT
DSP
DTFT
EQ
FFT
FFTW
FIR
FTP
GCC
GPU
GUI
HTTP
HRTF
I/O
IMD
IP
IPv4
IPv6
ISP
LAN
MFLOPS
TM
NEON
OSI
PC
RAM
RIR
S/PDIF
SBC
SIMD
SMTP
SNR
TCP
THD
TTL
UDP
VoIP
VST
WAN
WLAN
analogue to digital converter
audio engineering society 3
audio over IP
application programming interface
advanced research project agency network
central processing unit
digital to analogue converter
discrete Fourier transform
digital signal processor/processing
discrete-time Fourier transform
equalizer
fast Fourier transform
fastest Fourier transform in the west
finite impulse response
file transfer protocol
gnu compiler collection
graphics processing unit
graphical user interface
hyper text transfer protocol
head related transfer function
input/output
intermodulation distortion
Internet protocol
Internet protocol version 4
Internet protocol version 6
Internet service provider
local area network
mega floating point operations per second
advanced SIMD extension
open systems interconnection model
personal computer
random access memory
room impulse response
Sony/Philips digital interface format
single board computer
single instruction multiple data
simple mail transfer protocol
signal-to-noise ratio
transmission control protocol
total harmonic distortion
time to live
user datagram protocol
voice over IP
virtual studio technology
wide area network
wireless local area network
1
Introduction
Digital signal processing (DSP) plays an important role in modern electronics.
Analogue circuits have been gradually replaced by digital counterparts and devices
are increasingly equipped for measuring data from the environment. DSP chip market
is growing, introducing more cost-efficient integrated circuits and microprocessors for
DSP system designers. This thesis explores a cost-effective approach to implement a
DSP system in the context of audio signal processing.
Digital signal processing can be split into two categories, non-real-time and
real-time DSP. Non-real-time DSP analyses or manipulates data sets of arbitrary
size and can take an arbitrary time to process the data. Real-time DSP however,
delivers results instantaneously or within such a time that human perceives it as
instantaneous. For example, a digital musical instrument has to generate a tone
when a key is pressed and a biomedical sensor has to output information about
the patient’s condition in real-time. The real-time property restricts the available
processing time, creating additional requirements for algorithm and system design.
This thesis focuses on real-time DSP exclusively in the context of audio applications,
referred to as real-time audio signal processing.
Real-time audio signal processing has advanced remarkably in the past couple
of decades. The performance of processing units has increased, chip sizes have
shrunk and the cost of integrated circuits has decreased. The evolution in processor
technology brought audio synthesizers and effects to home computers in the late 90’s.
The general purpose processors at the time were able to perform signal processing
tasks with feasible latencies, i.e. in real-time.
At the same time, audio engineers have developed more complex real-time audio
applications. Modern applications such as accurate models of vintage analogue
instruments and amplifiers, realistic spatial reverberation effects, and room correction
equalizers typically require filters of very high order. Therefore, a considerable
amount of computational complexity is needed, even more so when these effects
are implemented for multiple channels, for example, in surround- or 3D-sound
applications.
1.1
Motivation
3D-sound reproduction techniques are being increasingly researched due to the
current prospects for virtual and augmented reality applications. A typical 3Dsound application reproduces the acoustics of a natural space, such as a concert
hall, in headphone or loudspeaker listening. Generally, an acoustic reproduction
requires spatial impulse response measurements recorded in the desired space using
a microphone array. However, due to the constraints of the current microphone
technology, additional processing methods are required in order to enhance the
directional accuracy of the spatial impression [1, 2].
These techniques not only require a considerable amount of processing in itself,
but also a substantial amount of convolutions to produce an output signal. Consider
an event of listening two musicians playing in different locations of the stage in
2
a concert hall. With an array of 6 directional microphones, the acoustics of the
performance could be reproduced by combining the 6-channel impulse responses from
both source locations, thus, requiring 12 convolutions to produce an output in a 6
speaker listening setup.
Implementing the processing in real-time would enable the use of arbitrary input
signals and the instant switching between source and receiver locations. In addition,
different impulse response sets could be loaded in real-time, as opposed to rendering
them offline with dedicated input signals.
The afore mentioned processing task can be implemented using professional DSP
hardware. However, the cost of such equipment is currently very high for a common
consumer. One cost effective alternative would be to use graphics processing units
(GPU). Using GPUs for parallel audio processing tasks has been studied recently
and proven to be a rather efficient option [3, 4]. However, GPUs are still fairly
costly, even though they are cheaper than professional audio gear. Additionally,
they require a motherboard, increasing both the size and cost of the implementation.
Finally, knowledge of a programming framework like Open Computing Language
(OpenCL) [5] or Compute Unified Device Architecture (CUDA) [6] is required in
order to implement general purpose computing on GPUs.
1.2
The proposed implementation
This thesis explores an optional approach where the processing of multiple channels
is distributed to several small, low power and low-cost single board computers (SBC)
over a local area network (LAN). Very recently, several cheap and portable SBCs
have become available on the market. The cost of these computers is typically a
fraction of the cost of a DSP processor. The operating system of SBCs is often
Linux, which is practical for C and C++ development. Linux is also compatible
with popular open source real-time audio application programming interfaces (API)
and environments including JACK Audio Connection Kit [7], Jules’ Utility Class
Extensions (JUCE) [8] and Pure Data [9].
In this thesis, we implement a system consisting of a number of single board
computers controlled by a host computer via LAN. Figure 1 visualizes the proposed
system. The network switch links the system components together and synchronizes
the outgoing broadcast data to arrive at approximately the same time at each SBC.
We also program the audio signal processing software for the single board computers. The software convolves input audio signals with long impulse responses.
The impulse responses are streamed from a host computer to the SBCs and can be
changed instantaneously for individual channels on the host. In other words, the
remote SBCs act merely as processing units, and impulse responses are not required
to be stored on the SBCs long-term memory.
This way, when updating the impulse response sets only the host computer
requires the new files, as opposed to copying them to each SBC. Moreover, the
host computer is more likely to have the storage capacity for an extensive impulse
response database as most SBCs only support microSD cards for internal storage.
3
Figure 1: The distributed DSP system. Solid lines denote Ethernet cables and dotted
lines denote analogue audio cables.
1.3
How to read this thesis
The thesis is structured as follows. Chapter 2 explains the key theoretical background
for the system, including an overview of digital signals, the discrete Fourier transform,
the fast Fourier transform algorithm and real-time convolution algorithms.
Chapter 3 discusses the methods and tools used in the implementation. We
compare a set of single board computers in the lower price range, describe the Pure
Data environment and the fast Fourier transform C library used in the implementation.
Chapter 4 then describes how the system is implemented by discussing the Pure
Data patches, the convolution algorithm and code optimization methods.
Chapter 5 describes the commonly used audio distortion attributes of a digital-toanalog converter and explains the measurement methods used. Chapter 6 discusses the
audio distortion measurement results. Additionally, the system latency measurement
is discussed and a code profiling report is analyzed. Lastly, Chapter 7 summarizes
the thesis with future ideas and proposals for alternative applications.
4
1
Amplitude
Amplitude
1
0.5
0
-0.5
-1
0
5
10
Time (samples)
(a)
15
0.5
0
-0.5
-1
0
5
10
Time (samples)
15
(b)
Figure 2: a) A discrete-time and b) a digital representation (stems) of a continuoustime sine wave (dotted).
2
Background
Programming real-time audio signal processing software is a multidisciplinary task.
Understanding of digital signals and systems is essential but knowledge of algorithm
analysis, computer architecture, and skills in software development are also necessary.
In this chapter we discuss the theory behind the digital representation of audio
signals, the discrete Fourier transform and discrete-time convolution. We also cover
the key fundamental networking concepts and give a brief introduction to audio over
IP.
2.1
Digital signals
Signals can be classified in three categories, continuous-time signals, discrete-time
signals and digital signals [10, p. 1]. Continuous-time signals are commonly also
referred to as analogue signals. Signals in nature are analogue by definition. Namely
they are defined continuously in time and have infinite resolution in amplitude values.
On the contrary, discrete-time signals are only defined at particular time instances.
Therefore, the amplitude values can be represented as a sequence of numbers. Digital
signals are discrete in both time and amplitude, enabling them to be stored and
processed by computers and digital hardware.
Figure 2 illustrates these three types of signal classes. In both graphs the dotted
sine wave represents a continuous-time signal. It has an amplitude value at every
point in time. In Fig. 2a, a discrete-time signal takes values of the sine wave function
at evenly spaced time instants. In Fig. 2b, a digital signal takes discrete values on
both time and amplitude axes. To emphasize the discretization of the amplitude
values, the amplitude resolution of the signal is chosen to be 3 bits. As Fig. 2b
shows, the amplitude is rounded to eight different values resulting in an incorrect
representation of the analogue wave. However, the values can be efficiently stored in
binary numbers using only three bits per value as log2 8 = 3.
Digital signals are often converted from analogue signals using a process called
the analogue-to-digital conversion. Microphones and loudspeakers require analogue
electronics in order to transform sound waves to electronic signals and vice versa.
5
s a m ple a nd ho ld
q ua ntize r
Figure 3: Ideal sampler. Adapted from [10, p. 4]
Additionally, analogue audio connectors are still the norm in consumer audio hardware.
Therefore, it is necessary for DSP-hardware to be able to convert the input signal
to digital and the output signal back to analogue before and after processing. The
circuits responsible for the conversions are referred to as an analogue-to-digital
converter (ADC) and a digital-to-analogue converter (DAC).
2.1.1
Sampling
The analogue-to-digital conversion consists of two processes known as sampling and
quantizing, as illustrated in Fig. 3. A sampler can be considered as a switch that
lets in a signal x(t) only at uniform time intervals Ts . The basic sampling function
can be implemented with a “sample-and-hold” circuit that maintains the signal level
until the next sampling instant. A quantizer approximates the signal level of g[nTs ]
and assigns a binary number to represent its value producing the digital signal y[n].
[10, p. 3]
Sampling a continuous-time signal x(t) results in a discrete-time sequence x[n].
The time variable n of the discrete-time sequence is an integer number. It is linked
to the time variable t of the continuous-time signal only at discrete-time instants tn ,
given by
tn = nTs =
n
fs
(1)
where Ts is the sampling interval and fs = 1/Ts is the sampling frequency or
sample rate. Again, we take a sine wave as an example but this time show the
relationship between a continuous-time sine wave signal and a corresponding discretetime sequence. Given a continuous-time signal
x(t) = A sin(2πf t + φ),
(2)
where A denotes amplitude, f denotes frequency and φ denotes phase offset, the
corresponding discrete-time signal is given by substituting Eq. (1) for t [11, p. 60].
x[n] = A sin(2π
f
n + φ)
fs
(3)
As can be seen, the frequency of digital signals is always proportional to the sampling
frequency of the system. Sample rate also determines the spectral bandwidth of
digital systems. The sampling theorem by Shannon [12] states that a signal with a
6
Amplitude
1
0.5
0
-0.5
-1
0
0.25
0.5
0.75
1
Time (s)
Figure 4: Aliasing in the time domain. The 3 Hz and the 1 Hz sine waves have the
same amplitude values at the sampling instants.
X(f)
-2fs
-fs
0
fs
2fs
fs
2fs
(a)
X(f)
-2fs
-fs
0
(b)
Figure 5: Aliasing in the frequency domain. In a) the signal is band limited below
the Nyquist frequency. In b) the frequency band exceeds the Nyquist limit.
maximum frequency of fmax must be sampled at a minimum rate of fs = 2fmax in
order to correctly represent the signal. The maximum frequency fmax is commonly
referred to as the Nyquist frequency and expressed as
fN ≤
fs
2
(4)
According to the sampling theorem, frequency components higher than fN fold back
to the frequency band 0..fN . This phenomenon is referred to as aliasing and is
considered as a distortion artefact in digital systems. Figure 4 shows a simplified
example of aliasing, where a 3 Hz sine wave is sampled at a rate of 4 Hz. Thus, the
Nyquist frequency is 2 Hz and the 3 Hz sine wave folds back producing the additional
7
1 Hz sine wave. Aliasing occurs because the 3 Hz and 1 Hz sine waves produce
identical amplitude values when sampled at 4 Hz sampling rate. As the sampling
instances are too sparse to recreate the 3 Hz tone, the 1 Hz tone is produced instead.
Aliasing can be further visualized in the frequency domain. Figure 5a shows
a signal with frequency content below the Nyquist frequency. However, Fig. 5b
visualizes another signal with a frequency band that exceeds the Nyquist limit. The
overlapping frequency bands visualize aliasing.
2.1.2
Quantization
The second phase of analogue to digital conversion involves quantization of the
sampled signal values. The quantization process rounds each sample value to the
nearest quantization level. The amount of quantization levels 2b depends on the
number of bits b that represent the sample values. The number of quantization levels
is more commonly referred to as the dynamic resolution of a digital signal and is
typically expressed in decibels.
DdB = 20 log10 (2b )
(5)
For example, the CD audio format which uses 16-bits per sample has a dynamic
resolution of approximately 96 dB covering most of the dynamic range of human
hearing. Other formats, such as DVD and Blu-Ray support 24-bit audio corresponding
to a 144 dB dynamic range and professional recording studios typically work with
32-bit resolution which is equivalent to 192 dB for maximum quality.
It should be noted that quantization introduces errors to the sampled signal
values due to rounding. However, the quantization error decreases proportional to
the amount of bits used. At 16 bits and higher, the error becomes indistinguishable
to the human ear in most applications.
2.2
The discrete-time Fourier transform
We have now described digital signals and how they are represented as time domain
sequences. However, many applications benefit from representing signals in the
frequency domain instead. Transforming a signal from the time domain to the
frequency domain representation is carried out using the Fourier transform. In
this section, we describe the general Fourier transform for discrete-time signals, the
discrete Fourier transform for finite discrete sequences, and the fast Fourier transform
algorithm that computes the discrete Fourier transform efficiently in software.
The discrete-time Fourier transform (DTFT) is a representation of a discrete-time
sequence x[n] in terms of the complex exponential sequence e−jωn , where ω is the
normalized frequency variable [11, p. 118]. The discrete-time Fourier transform X(ω)
of a sequence x[n] is defined by the equation
X(ω) =
∞
X
n=−∞
x[n]e−jωn
(6)
8
and the inverse transform is
1 Zπ
X(ω)ejωn dω
x[n] =
2π −π
(7)
The DTFT is defined for all the values of ω. A key difference between the DTFT
and the Fourier transform for continuous time signals is that the DTFT is a periodic
function of the frequency variable ω [13, p. 13]. For integer values of m, Eq. (6) can
be written as
X(ω) = X(ω + 2πm) =
∞
X
x[n]e−j(ω+2πm)n
(8)
n=−∞
This is due to the periodicity of the unit circle. The same phenomena causes the
periodicity and aliasing in conjunction with sampling as discussed in Section 2.1.1.
Consequently, the Fourier transform of a discrete-time signal is a periodic function
of frequency.
2.2.1
The discrete Fourier transform
In the case of a finite-length sequence x[n], where n = 0, 1, ..., (N − 1), the discretetime Fourier transform can be expressed in a simpler way. A sequence of length N only
requires N values of X(ω) at distinct frequency points ωk , where k = 0, 1, ..., (N − 1),
to determine the frequency content of x[n]. This special case of the DTFT is known
as the discrete Fourier transform (DFT). The DFT can be expressed by uniformly
sampling X(ω) between 0 ≤ ω ≤ 2π at ωk = 2πk/N . This way the DFT is obtained
by
X[k] =
N
−1
X
x[n]e−j2πkn/N
(9)
n=0
X[k] is also a sequence of length N and is referred to as the DFT of the sequence
x[n]. In order to make the equation more compact, the exponential is commonly
written as
WN = e−j2π/N
(10)
and Eq. (9) can be rewritten as
X[k] =
N
−1
X
x[n]WNkn
n=0
The inverse discrete Fourier transform (IDFT) is given by the equation
(11)
9
x[n] =
N
−1
X
X[k]WN−kn
(12)
k=0
The main difference between the DFT and the DTFT is that the frequency and time
variables of DFT are both discrete, whereas the DTFT has a continuous frequency
variable and a discrete time variable. Additionally, time-shifting the sequence X[k]
must be implemented circularly due to the finite-length definition of the DFT. This
property is further discussed in Section 2.3.3. Otherwise, the properties of the two
transforms are similar. The next subsection describes fast Fourier transform, an
algorithm that effectively calculates the DFT in software applications.
2.2.2
The fast Fourier transform
The fast Fourier transform (FFT) is an algorithm that provides a numerically efficient
way to calculate DFT. The algorithm was first proposed by Cooley and Tukey in 1965
[14] revolutionizing the signal processing on digital computers. The FFT is a divide
and conquer algorithm that breaks a large DFT into a combination of multiple small
DFTs by exploiting the computational redundancies of the DFT. Examining Eq. (11),
it can be seen that the opposite values of k and n yield the same result. For example,
assigning k = 1 and n = 2 yields the same result WNkn = e−j2πkn/N = e−jπ4/N as
assigning k = 2 and n = 1.
The following two properties of WN enable the deconstruction of a large DFT
into smaller DFTs [15]. First,
WN2 = (e−j2π/N )2 = e−j2π2/N = e−j2π/(N/2)
= WN/2
(13)
Second,
(k+N/2)
WN
N/2
= WNk WN
= −WNk
= WNk e−j(2π/N )(N/2) = WNk e−jπ
(14)
In order to exploit these properties, the data sequence xn is divided into two equal
length sequences, the even-numbered data x2n and the odd-numbered data x2n+1 .
To divide a signal into two equal length sequences, the signal itself must be evennumbered. If this is not the case the signal should be padded with a zero. This
allows the DFT, X[k], to be written in terms of two DFTs, the even-numbered and
the odd-numbered DFT. Thus, we can write Eq. (11) as
N/2−1
X[k] =
X
N/2−1
x2n WN2nk +
=
n=0
(2n+1)k
x2n+1 WN
n=0
n=0
N/2−1
X
X
N/2−1
x2n WN2nk
+
WNk
X
n=0
x2n+1 WN2nk
(15)
10
1
Xe[0]
x[0]
X[0]
1
Xe[1]
x[2]
N/2
DFT
x[4]
1
Xe[2]
W81
1
X[2]
1
W82
1
X[3]
W83
1
Xo[0]
x[1]
1
X[1]
Xe[3]
x[6]
W80
X[4]
W84
Xo[1]
x[3]
N/2
DFT
x[5]
X[5]
W85
Xo[2]
X[6]
W86
Xo[3]
x[7]
X[7]
W87
Figure 6: 8-point DFT decomposed into two 4-point DFTs.
where k = 0, 1, ..., N − 1. We can use the first property WN2 = WN/2 described in
Eq. (13) to rewrite Eq. (15) as
N/2−1
X[k] =
X
n=0
N/2−1
nk
x2n WN/2
+
WNk
X
nk
x2n+1 WN/2
(16)
n=0
and further
X[k] = Xe [k] + WNk Xo [k]
(17)
where Xe [k] is the even-numbered DFT and Xo [k] is the odd-numbered DFT. The
decomposition of the DFT is typically visualized using so called “butterfly” graphs.
Figure 6 illustrates the decomposition of an 8-point DFT into two 4-point DFTs.
The arrows denote multipliers and the nodes denote a sum. In mathematical terms,
the output terms X[0], ..., X[N/2 − 1] are obtained by
X[0]
X[1]
X[2]
X[3]
=
=
=
=
Xe [0] + W80 Xo [0] = Xe [0] + Xo [0]
Xe [1] + W81 Xo [1] = Xe [1] + e−jπ/4 Xo [1]
Xe [2] + W82 Xo [2] = Xe [2] − jXo [2]
Xe [3] + W83 Xo [3] = Xe [3] + e−jπ3/4 Xo [3]
(18)
11
(k+N/2)
and using the property WN
1] are obtained by
X[4]
X[5]
X[6]
X[7]
=
=
=
=
= −WNk described in Eq. (13), the terms X[N/2], ..., X[N −
Xe [0] − W80 Xo [0] = Xe [0] − Xo [0]
Xe [1] − W81 Xo [1] = Xe [1] − e−jπ/4 Xo [1]
Xe [2] − W82 Xo [2] = Xe [2] + jXo [2]
Xe [3] − W83 Xo [3] = Xe [3] − e−jπ3/4 Xo [3]
(19)
It can be seen that the second half of the output sequence is equal to the first, except
that the odd-numbered terms have an opposite sign. This is another property that
FFT algorithms are able to exploit to optimize their performance.
Decomposing an N -point DFT into the sum of two N/2-point DFTs effectively decreases the number of multiplications from N 2 to N 2 /2. Additional N multiplications
are also required to calculate Eq. (17). An N-point DFT can be decomposed further
by recursively repeating the process until N/2 2-point DFTs remain. Sequences that
have a length of N = 2p divide by two up to p times. The number of multiplications
M required by the FFT algorithm is given by
M=
N2
+ pN = N + N log2 N
2p
(20)
The first term indicates the number of multiplications required by the split DFTs
and the second term indicates the number of multiplications required in order to
calculate Eq. (17) at each p recursions. When N grows large the term N log2 N
dominates giving a computational complexity of O(N log2 N ). Using power of twolength sequences when executing FFT algorithms is recommended as they divide
evenly for the maximum number of times.
Understanding the FFT algorithm is not easy and implementing FFT optimally
in software requires knowledge of computer architecture. Luckily, plenty of open
source FFT libraries have been developed for programmers to easily integrate the
FFT algorithm in their DSP software projects. We further discuss FFT libraries in
Section 3.3. The next subsection discusses discrete-time convolution and the related
concepts.
2.3
Discrete-time convolution
Convolution is one of the most fundamental operations in DSP. The convolution
operation generally describes how a system interacts with its input in order to produce
an output. Convolution can alter the input in both amplitude and time, producing a
delayed and attenuated or amplified version of the input signal.
A finite impulse response (FIR) filter is usually implemented by directly convolving
the input signal with the impulse response of the filter. The basic operation of
autocorrelation is the convolution of the signal with a time-reversed version of itself.
A recently popular application for convolution is the implementation of realistic
12
reverberation and spatial effects by using impulse response measurements from
acoustic spaces, such as concert halls, caves or speaker cabinets. Convolving an input
signal with a measured impulse response replicates the acoustics of the measured
space in the output signal.
In this section, we further discuss the concepts related to discrete-time convolution,
such as linear and circular convolution, as well as the relationship between discretetime convolution and the DFT. We also describe the algorithms that calculate
convolution efficiently under the real-time constraint.
2.3.1
Linear convolution
The distinction between linear convolution and circular convolution is an important
aspect in discrete-time convolution. The differences between the two come apparent
when segmenting a long convolution into multiple short convolutions and calculating
convolution in the frequency domain.
The linear convolution of two discrete-time signals x(n) and h(n) is defined by
the following equation [13, p. 84]:
y[n] = h[n] ∗ x[n]
y[n] =
∞
X
h[k]x[n − k]
(21)
(22)
k=−∞
Equation (21) is a commonly used short version of the equation, where ∗ denotes
convolution. By definition, the signals must be defined for all values of k, from −∞
to ∞. In practice however, convolution is performed on finite length signals. Given
an input signal x[n] of length N , and an impulse response h[n] of length M , the
convolution result y[n] is of length L = N + M − 1. If x[n] is defined to be zero
outside the interval [0, N − 1], the linear convolution can be expressed as
y[n] =
M
−1
X
h[k]x[n − k]
(23)
k=0
where n = 0, 1, ..., (L − 1).
2.3.2
Circular convolution
Unlike linear convolution, circular convolution is only defined within the boundaries
of the input signal x[n]. The index of x is evaluated modulo N [13, p. 88], e.g. larger
indices than N − 1 wrap around to the beginning of the input sequence. Circular
convolution is widely used in real-time DSP as it can be efficiently implemented with
circular buffers.
Given two sequences x[n] of length N and h[n] of length M , the circular convolution
is defined by the equation
13
y[n] = h[n] ~ x[n]
y[n] =
M
−1
X
h[k]xhn − kiN
(24)
(25)
k=0
Equation (24) is again the short version of the formula where ~ denotes circular
convolution. The symbol hn − kiN represents the residue of (n − k) mod N , and
n = 0, 1, ..., (N − 1).
In contrast to linear convolution, circular convolution results in a length N output
signal. This causes the output to wrap around itself which is undesirable in audio
applications. Thus, additional array processing techniques, such as zero padding and
block overlapping are required. Circular convolution also maps to convolution in the
frequency domain which we discuss next.
2.3.3
Fast convolution
Fast convolution is generally used to describe frequency domain convolution. The
word "fast" comes from the fact that frequency domain convolution is in most cases
faster to compute than time domain convolution. This is due to the property of the
DFT that maps circular convolution in the time domain to multiplication in the
frequency domain [11, p. 141]. This transform property is expressed by
y[n] = h[n] ~ x[n] ⇔ Y [k] = H[k]X[k]
(26)
However, in the case of linear convolution, the DFT of y is not equal to the product
of the DFTs of x and h. Since DFT is a representation of frequency samples it
corresponds to a periodic time signal. Additionally, it is obvious that the product of
two sequences of length N yields an output sequence of length N instead of 2N − 1.
Therefore, it is clear that the product of the DFTs corresponds to cyclic convolution
instead of linear convolution. [13, p. 86]
Instead, linear convolution is equivalent to multiplication in z-domain according
to the convolution property of the z-transform. The z-transform property of linear
convolution is expressed as
y[n] = h[n] ∗ x[n] ⇔ Y (z) = H(z)X(z)
(27)
It is important to distinguish between the z-transform and the DFT, as well as the
notations ∗ and ~ that represent linear and circular convolution. In this thesis, we
focus on circular convolution as it can be efficiently implemented in the frequency
domain using the FFT.
Fast convolution is useful when the convolved signals are long. This becomes
evident by comparing the number of multiplications required by the time domain
linear convolution and fast convolution. It can be seen from Eq. (22) that obtaining
14
Table 1: The number of real multiplications required for the convolution of two
length-N sequences. Adopted from [15, p. 224]
N
Direct method
8
16
32
64
128
256
512
1024
2048
64
256
024
096
384
536
144
576
304
1
4
16
65
262
1 048
4 194
Fast convolution
1
2
5
13
29
65
143
311
448
088
560
888
312
696
536
360
296
the linear convolution of two length-N sequences h[k] and x[n − k] it is necessary to
multiply each value of h[k] by each value of x[n − k]. Therefore, convolution in the
time domain requires N 2 multiplications.
Fast convolution of the same length-N sequences requires a bit of additional
work. To avoid the wrapping effect of the circular convolution, we must augment the
sequences with zeros so that they are at least 2N − 1 samples long. For simplicity,
we assume 2N − 1 ≈ 2N . As discussed in Section 2.2.2, an N -point FFT performs
(N/2) log2 N complex multiplications. Thus, for a 2N -point FFT N log2 2N complex
multiplications are necessary. Fast convolution requires a total of two DFTs and one
IDFT to be computed as both input signals must be transformed and the output
signal must be inverse transformed. Therefore, the number of complex multiplications
becomes 3N log2 2N . Furthermore, the actual convolution operation requires the
evaluation of the complex multiplication X[k]H[k] as shown in Eq. (26) increasing
the number of complex multiplications to 3N log2 2N + 2N . Finally, as each complex
multiplication is of the form
Z = (A + jB)(C + jD) = AC − BD + j(AD + BC)
(28)
where the complex multiplication consists of four real multiplications. This leads
to the actual number of real multiplications required by fast convolution which is
12N log2 2N + 8N .
We conclude that time domain linear convolution requires N 2 real multiplications
while fast convolution requires 12N log2 2N + 8N of them. Table 1 compares the
amount of real multiplications necessary for each method at different values of N .
Table 1 shows that fast convolution is faster than direct time domain convolution
for sequences of length 128 or longer. It is evident that the exponential growth of
the multiplications makes time domain convolution infeasible when convolving long
signals. However, when the convolved signals are shorter than 128, the overhead for
15
computing FFTs and complex arithmetic results in a longer computation time than
using direct time domain convolution.
It should be noted that even though the number of multiplications is a good
theoretical measure of the computational complexity of a DSP algorithm, it does
not directly describe how many CPU cycles an algorithm requires in practice. The
overall performance of an algorithm also depends on other arithmetic operations, as
well as reading and writing data from/into registers and memory.
2.4
Real-time convolution methods
In this section, we describe techniques used in real-time convolution. We discuss
input segmentation and methods that implement block-wise circular convolution and
output reconstruction.
2.4.1
Signal segmentation
In real-time digital systems the input signal is considered to be of infinite length.
Thus, it is necessary to divide the input signal into finite length blocks. The input
signal x[n] can be written as a sequence of length-B blocks [13, p. 86].
(
xm [n] =
x[n]
0
mB ≤ n ≤ mB + B − 1
else
(29)
where m is an integer that denotes the block index. Substituting Eq. (29) into
Eq. (21) gives
∞
X
y[n] = h[n] ∗
xm [n]
(30)
m=−∞
The output signal can also be expressed as a sequence of blocks by writing Eq. (30)
as
ym [n] = h[n] ∗ xm [n]
(31)
In practice, the segmentation is typically carried out by the audio interface which
buffers the input samples from the ADC. A DSP application is then able to read
this buffer of samples, conduct processing, and write the processed samples to the
output buffer of the audio interface.
In the case where the impulse response h[n] is longer than the length of the input
block B it is practical to also divide h[n] into blocks of length B. Given an impulse
response of arbitrary length M , the number of length-B blocks is obtained by
L=
M
B
(32)
16
Using Eq. (29) we can express the segmented impulse response hq [n] as
(
hq [n] =
qB ≤ n ≤ qB + B − 1
else
h[n]
0
(33)
where q = 0, 1, ..(L − 1). Consequently, substituting Eq. (33) into Eq. (31) we can
express a block-wise convolution of an infinite sequence xm [n] and a length-L impulse
response hq [n] by the following equation
ym [n] =
L−1
X
hq [n] ∗ xm−q [n]
(34)
q=0
The reconstructed output signal y[n] is then given by
∞
L−1
X
X
y[n] =
hq [n] ∗ xm−q [n]
(35)
m=−∞ q=0
Dividing h[n] to the same block length as x[n] induces minimum amount of
processing per sample buffer. Blocks longer than B cause unnecessary pre-emptive
processing because the output buffer requires only B samples, respectively. Additionally, B is the minimum length that enables the processing of each input block
with one convolution and to correctly reconstruct the output signal.
Next we describe two techniques that implement input segmentation, convolution
and output reconstruction. The techniques are referred to as overlap-add and
overlap-save convolution. These techniques are essential in real-time convolution
applications.
2.4.2
Overlap-add method
The first of the two block-wise convolution techniques is the overlap-add method.
The method takes an input signal xm [n] which is divided into length-B blocks as
described in Section 2.4.1. We also assume a long impulse response hq [n] that is
divided into L length-B blocks.
The problem that arises when implementing block-wise convolution is that while
the output buffer requests B samples, the linear convolution of two sequences of
length B results in an output ym [n] of length 2B −1 samples. The overlap-add method
approaches this problem by writing the first B samples ym1 = ym [0], ym [1], ..., ym [B−1]
into the output buffer and storing the other B − 1 samples ym2 = ym [B], ym [B +
1], ..., ym [2B − 1] into memory. The stored sequence ym2 is the overlapping portion of
the convolution result that is added to the output portion of the next convolution.
Figure 7 visualizes the overlap-add method. In this simplified case the input
signal consists of two blocks x1 [n] and x2 [n] which are convolved sequentially with
the impulse response h[n]. The blocks and the impulse response are of equal length.
17
h[n]
x1[n]
x2[n]
y1[n]
+
y2[n]
y[n]
Time (n)
Figure 7: Overlap-add convolution.
The correct convolution result is obtained by adding the overlapping portion of y1 [n]
to y2 [n].
However, as discussed in Sections 2.3.2 and 2.3.3, convolution in software is often
implemented as either circular convolution or fast convolution. In order to obtain the
correct output while using circular convolution, the input blocks must be augmented
with zeros, i.e. a minimum number of B − 1 zeros is inserted before the input block.
Furthermore, fast convolution requires the input and the impulse response to be of
equal lengths in order to perform the complex multiplication. Thus, the impulse
response is padded with a minimum of B − 1 zeros, respectively.
In practice however, due to power of two audio buffer sizes, and as FFT algorithms
perform optimally with power of two block sizes, the augmentation is implemented
with B instead of B − 1 samples. This way, the augmented blocks are guaranteed to
be power of two in size given that the audio input/output buffer size is a power of
two.
Figure 8 visualizes the overlap-add method in conjunction with circular convolution
and the augmented zeros in each input and impulse response block. The graph shows
three iterations of overlap-add convolution. The impulse response is divided into
three blocks h0 , h1 and h2 . The current input block is always convolved with the first
impulse response block h0 and the two previous input blocks are convolved with h1
and h2 , respectively. The input blocks are stored in a circular buffer, a data structure
that overwrites the first value in the buffer with the most recent one. The figure also
shows that after the summation of the convolution results only the second half of
the output block ym is written to the output buffer and the first half of the output
sm is stored and added to the second half of the next output block ym+1 .
18
x0
0000
x0
*
h0
0000
=
y00
s00
+
0000
x-1
*
h1
0000
=
y01
s01
+
0000
x-2
*
h2
0000
=
y02
s02
=
y0
s0
y0
s0
+
x1
0000
x1
*
h0
0000
=
y10
s10
+
0000
x0
*
h1
0000
=
y11
s11
+
0000
x-1
*
h2
0000
=
y12
s12
=
y1
s1
y1
s1
+
x2
0000
x2
*
h0
0000
=
y20
s20
+
0000
x1
*
h1
0000
=
y21
s21
+
0000
x0
*
h2
0000
=
y22
s22
=
s2
y2
y2
s2
Figure 8: Overlap-add convolution with segmented hq .
Implementing the overlap-add method with fast convolution is otherwise as
described in Fig. 8, except the convolutions are of the form
ym =
L−1
X
IFFT{FFT{xm }FFT{hq }}
(36)
q=0
2.4.3
Overlap-save method
The second technique that implements block-wise convolution is the overlap-save
method. Unlike the overlap-add technique, the overlap-save method augments the
input blocks with samples from the previous blocks instead of zeros. This way the
circular convolution produces both the overlapping segment of the previous output
and the current non-overlapping segment to the second segment of the output block
ym [B], ym [B + 1], ..., ym [2B]. As the current input block will be used in the next
iteration, the overlapping segment ym [0], ym [1], ..., ym [B − 1] can be discarded. The
name overlap-save comes from this property of “saving” the second segment of the
output block and discarding the first segment.
19
x0
x-1
x0
*
h0
0000
=
y00
x-2
x-1
*
h1
0000
=
+
y01
x-3
x-2
*
h2
0000
=
+
y02
=
y0
x1
x0
x1
*
h0
0000
=
y10
x-1
x0
*
h1
0000
=
+
y11
x-2
x-1
*
h2
0000
=
+
y12
=
y1
x2
x1
x2
*
h0
0000
=
y20
x0
x1
*
h1
0000
=
+
y21
x-1
x0
*
h2
0000
=
+
y22
=
y2
Figure 9: Overlap-save convolution with segmented hq .
The overlap-save method is visualized in Fig. 9. The figure visualizes three
iterations of inputs x0 , x1 and x2 . The input is convolved with three impulse response
blocks h0 , h1 and h2 . The figure shows that each input block xm is now preceded by
the previous input block xm−1 . The red crosses over the overlap segments illustrate
the omission of these segments.
Due to the omission of the overlap segments, the overlap-save method requires
half as many additions as the overlap-add technique. If L is the number of impulse
response blocks and B is the block length in number of samples, the overlap-add
method requires 2B(L − 1) + B additions for the summation of the convolution
outputs. In contrast, the overlap-save method only requires B(L − 1) additions.
It should be noted, however, that by definition the overlap-save method only
implements circular convolution. Conversely, the overlap-add technique also applies
with linear convolution, in which case the block augmentation is unnecessary. Nevertheless, convolution applications that use long impulse responses, possibly utilizing
fast convolution, clearly benefit from using the overlap-save method.
2.5
Networking concepts
Our system uses networking for digital audio transmission. Therefore, we need to
discuss the relevant theory behind networks and network protocols in order to set
up the network correctly and decide on the right network protocols. We begin by
describing some elementary networking concepts moving on to discuss protocol stacks
and Internet addressing.
20
2.5.1
Computer networks
A computer network is a set of hardware and software that enables computers and
other devices, e.g. printers and file servers, to share information with one another
through telecommunications media, e.g. telephone lines or radio [16, p. 3]. Every
piece of hardware connected to a network is commonly abstracted as a node. Each
node is identified by two types of addresses. The first is the hardware address, also
known as the MAC (Media Access Control) address, that physically identifies the
device. This address is commonly assigned by the hardware manufacturer and is
difficult to change afterwards. The second address is a software address, assigned
and used by the software that handles the data transmission.
Networks are generally classified by the distance they cover. A local area network
(LAN) is a network confined in a small geographical area, e.g. a floor, a single building
or a set of buildings in a close physical proximity. A wide area network (WAN) covers
a large geographical area from the scales of a city to multiple countries. A WAN often
uses telecommunications media leased from commercial network providers. Internet
in lower case letters (internet) describes a WAN that connects multiple networks into
a larger network. When written with a leading capital letter (Internet), it denotes
the global network that supports the World Wide Web. Our DSP system is typically
confined in a small area, for example, a listening room or a stage. Therefore, we
discuss networking in the perspective of LANs only.
In communications, data messages are commonly wrapped inside network packets.
The term packet is used to describe a general unit of data carried over a network.
A packet consists of a sequence of user data and control information. Control
information is a set of headers and trailers that provide information on how to
deliver the data, e.g. source and destination addresses, error correction methods and
sequencing information.
2.5.2
Connection-oriented and connectionless communication
Data communications can be divided into two categories, connection-oriented and
connectionless communication [17, p. 7]. In connection-oriented data transfer a
connection is first established between a transmitter and a receiver. Additionally, the
transmitter verifies that the messages have arrived correctly to the receiver. Most
network file transfer applications, for example, use connection-oriented communications.
In connectionless data transfer, no connection is established and the transmitter
does not know, whether the sent data arrives correctly or not. Most media streaming
applications, for example, use connectionless communication, as they constantly send
high amounts of data and individual corrupted or lost packets have little consequences.
2.5.3
Network protocols
The procedures used when transferring data over a network are known as network
protocols [16, p. 9]. The protocols provide a standardized way for computers to
format and transfer data. Without an open standardized way of networking, hardware
21
and networks from different manufacturers would not be compatible with each other.
Protocols are often defined in groups referred to as protocol stacks. In the next
section we discuss a layered communications model which is a standard blueprint for
most protocol stacks in use today.
2.6
OSI Model
Most standard protocol stacks in use today are designed according to the open systems
interconnection (OSI) model (ISO/IEC 7498-1) [18]. The model was originally defined
in 1978 by ISO (International Standardization Organization) to create standardized
open guidelines for networking [19, p. 21]. OSI defines a data communications
management structure that breaks data communications down into a hierarchy of
seven layers. The layers are based on function, which simplifies the network model,
and allows easier implementation and better consistency [17, p. 4]. The layers are
briefly described as follows.
Application The top-most layer of the OSI model is responsible for giving user
applications access to the network [19, p. 25]. Examples of application-level
tasks include file transfer and e-mail services.
Presentation The presentation layer is responsible for converting information into a
suitable format for applications and users [19, p. 26]. Examples of presentation
layer services include conversion of special character sets, data compression or
expansion, and data encryption or decryption.
Session The purpose of the session layer is to organize the communication for
multiple communication sessions that take place at the same time [19, p. 26].
The layer also ensures that the proper security measures are taken during a
connection.
Transport The transport layer ensures data transfer at a specified level of quality,
such as certain transmission speeds and error rates [19, p. 26]. The level of
quality depends on whether connection-oriented or connectionless transfer is
used. Additionally, the layer organizes packets into correct sequence according
to packet numbers.
Network The network layer interfaces with other networks by determining addresses
or translating from hardware to network addresses [19, p. 27]. It also ensures
that the size of the packets is compatible with the receiving network by fragmenting large packets if needed. Moreover, it routes the packets through the
fastest possible path from source to destination [17, p. 10].
Data Link The data link is the first software layer and provides the means to transfer
and receive data packets over the physical layer. It also provides detection for
errors caused by physical phenomena, e.g. electro magnetic interference and
temperature, in the communication channel [17, p. 10].
22
Figure 10: Encapsulation of the data in the OSI layer model.
Physical The physical layer is the lowest layer and the only hardware layer in
the OSI model. It converts data packets created by the data-link layer and
converts them into electrical signals that represent the values 0 and 1 in digital
transmission [19, p. 28]. The layer also converts received electrical signals into
a series of bits and groups them into packets for the data-link layer.
The layer concept is further visualized in Fig. 10. Each layer contains a separate
protocol that interfaces with the layers above and below. When a transmitting device
sends a message, it traverses the protocol stack top-down. The user application
stands at the top ultimately defining the format of the information to be sent. Each
layer then gets information from the upper layer and formats the network packet
accordingly. In addition, the lower three software layers insert header fields (the
Data Link inserts both a header and a trailer) to the network package. The headers
specify how to unpack and present the information in the receiving device. This
procedure is referred to as encapsulation.
The overall communication data flow is illustrated in Fig. 11. When a device
receives data, the layers work with the network packets in opposite order. Headers
are extracted and each layer receives information from the lower layer. The extraction
of headers is referred to as decapsulation respectively.
In the end, the OSI model is only a standardized reference for implementing
protocol stacks. Several different protocol stacks are in use and their layering structure
and detailed functionality might differ from the OSI model. However, most modern
protocol stacks can generally be mapped to the OSI model. In the next section, we
23
Transmitting device
Receiving device
Application
software
Application
software
Application
Encapsulation
Decapsulation
Application
Presentation
Presentation
Session
Session
Transport
Transport
Network
Network
Data Link
Data Link
Physical
Physical
transmission media
Figure 11: The data flow concept in OSI model.
discuss a protocol stack, which most modern networks, including the Internet, are
based on.
2.7
TCP/IP protocol stack
The TCP/IP (Transmission Control Protocol/Internet Protocol) protocol stack was
originally based on a project known as ARPANET (Advanced Research Project
Agency NETwork) which was initiated in 1969 by US department of Defence [17,
p. 19]. The aim of the project was to create a wide-area communication system that
connects heterogeneous hardware and software systems across the United States.
Initially the network connected only government and university facilities, but as time
progressed, commercial companies were allowed access. In 1983 the TCP/IP model
was fully adopted. Since then, ARPANET was gradually referred to as the Internet.
Unlike the OSI model that was designed as a reference for implementing protocols,
the TCP/IP stack was developed based on existing protocols [17, p. 20]. The
name of the protocol stack itself lends from its two core protocols, the TCP and
IP protocol, which are discussed later in this section. Even though the OSI and
TCP/IP models were developed from different perspectives in different times, they
both serve as frameworks for communication infrastructure. Despite the success
of the TCP/IP stack, the OSI model still provides an educational insight on how
networks inter-operate.
Whereas the OSI model consists of seven layers, the TCP/IP stack only consists
of four layers called application layer, transport layer, Internet layer and network
access layer. Despite having fewer layers, the TCP/IP stack maps well to the OSI
24
OSI model
TCP/IP model
Application
Presentation
Application
Session
Transport
Transport
Network
Internet
Data Link
Network
Access
Physical
Figure 12: OSI and TCP/IP models.
model. Figure 12 illustrates the contrast between the OSI and TCP/IP layers. The
application layer includes the functionalities of all three upper layers of the OSI
model. The transport and the Internet layers correspond to the respective OSI layers.
The network access level encompasses the functionality of both the Data Link and
the Physical layers.
2.7.1
Application layer
The application layer provides the user or application programs with interfaces to the
transport layer [19, p. 77]. Additionally, the layer handles data representation and
encoding incorporating the functionality of the session and presentation layers of the
OSI model. The application layer contains a wide range of high-level protocols, some
of the common ones being the Hyper Text Transfer Protocol (HTTP), the Simple
Mail Transfer Protocol (SMTP) and the File Transfer Protocol (FTP).
2.7.2
Transport layer
The transport layer serves the same overall functionality as described in the OSI
model. The layer ensures data transfer at a specified quality and ensures data
integrity, e.g. sequencing and error handling, between the host and the sender. The
layer implements two protocols, Transmission Control Protocol (TCP) and User
Datagram Protocol (UDP).
TCP, as described in RFC 793 [20], is a connection-oriented and reliable host-tohost protocol. The protocol creates a connection between the source and destination
devices, breaks up the data into segments at the source and reassembles them at the
destination. Note that the term segment is used to denote network-packets at this
layer. Segments that are damaged, lost, duplicated, or delivered out of order by the
network, are recoverable by TCP.
25
0
15
16
Source port
31
Destination port
Sequence number
Acknowledgement number
Offset
Reserved
Ctrl bits
Window
Checksum
Urgent pointer
Padding
Options
Figure 13: TCP header.
0
15
16
31
Source port
Destination port
Length
Checksum
Figure 14: UDP header.
TCP inserts a checksum field in the TCP header to verify that a segment is not
damaged. At the destination, the protocol calculates a checksum number for the
segment and compares it with the checksum in the header. If the two checksums
mismatch, the data contains errors and a retransmission is requested. In order to
recover lost segments, the sending TCP requires a positive acknowledgement from
the receiving TCP. If the acknowledgement is not received within a set time interval,
the sending TCP retransmits the segment. In addition, the protocol uses sequence
numbers to correctly order segments that may be received out of order. This also
eliminates duplicate segments.
These verification methods require an extensive header to be attached to each
data segment. Figure 13 illustrates the TCP header. The header consists of a total
of 12 fields, including the source and destination port numbers, the above described
fields used for package recovery, as well as other fields used for connection related
parameters. The numbers at the top denote the length of the fields in bits. Further
details of the fields can be found in the literature [17, 19].
Unlike TCP, UDP is a connectionless protocol offering a procedure for application
programs to send and receive data with a minimal protocol mechanism [21]. The
protocol is unreliable, i.e. delivery, duplication and ordered sequencing of packets are
not guaranteed. However, without the various connection maintenance mechanisms
of TCP, UDP enables faster transmission times for applications. The protocols differ
slightly in terminology as well. Network packets transferred using UDP are commonly
referred to as datagrams, whereas packets sent using TCP are called packets.
Figure 14 shows the UDP header. The fields include the source and destination
port numbers, of which the source port is optional. The length field indicates the
length of the datagram in bytes including the header and the data. The checksum
is similar to the one used in TCP. However, it only protects against misrouted
26
0
15
7 8
Version
IHL
31
Total Length
Type of Service
Identification
Time to Live
16
Fragment Offset
Flags
Protocol
Header Checksum
Source IP address
Destination IP address
Options (if needed)
Figure 15: IP header.
datagrams by discarding the ones with mismatching checksums.
As can be seen, the UDP header is 12 bytes smaller than the TCP header without
options. This can save a significant amount of bandwidth when a large number
of packets are transferred. The minimal overhead combined with the fast protocol
mechanism makes UDP a good choice in broadcast and time-critical applications,
e.g. video streaming and Voice over IP (VoIP).
2.7.3
Internet layer
The Internet layer defines an addressing scheme for packets, selects the best path for
the data to travel from source to destination, as well as fragments and reassembles
packets when necessary [17, p. 22]. The procedure of finding the optimal path for
network packets is also known as routing. The main protocol operating in this layer is
IP (Internet Protocol) [22]. IP inserts a header with address and control information
to outgoing packets. The structure of an IP header is shown in Fig. 15.
The TTL (time to live) field defines how many routing nodes the packet is able to
traverse before it is discarded. This ensures that lost packets do not keep wandering
in the network looking for their destination.
The header checksum provides a verification that the header information used in
processing the packet has been transmitted correctly. Unlike TCP, the IP checksum
does not verify the correctness of the data. If the checksum fails the packet is
discarded.
The identification, the flags and the fragmentation offset fields are used in the
fragmentation and reassembly of data packets. Packet fragmentation is needed if
large packets allowed by the local network are sent to another network which limits
packets to a smaller size. The receiver uses the identification field to ensure that
fragments from different packets are not mixed. The fragment offset tells the receiver
the position of the fragment in the original packet. The flags field indicates the last
fragment.
The IP address fields are used for identifying the source and destination of packets,
as well as routing. IP addresses are software addresses that can be flexibly changed
when required. We discuss IP addressing in more detail in Section 2.8.
27
TCP/IP layer
Protocol implementation
File Transfer
Protocol (FTP)
Simple Mail
Transfer
Protocol
(SMTP)
Hyper Text
Transfer Protocol
(HTTP)
Trivial File
Transfer Protocol
(TFTP)
Network file
Systems
Protocol
(NPS)
RFC 959
RFC 821
RFC 2616
RFC 783
RFC’s 1014,
1057, 1094
Application
TCP
Transport
Internet
Address resolution
ARP RFC 826 & RARP RFC903
Network
Access
Simple Network
Management
Protocol
(SNMP)
RFC 1157
UDP
Internet Protocol (IP)
RFC 791
Internet Control Message Protocol
(ICMP) RFC 792
Network interface cards:
Ethernet, Token-Ring, ARCNET, MAN and WAN, RFC 894, 1024, 1201 and others
Figure 16: TCP/IP protocols.
2.7.4
Network access layer
The purpose of the network access layer is to define the procedures used to interface
with the network hardware and access the physical transmission medium [17, p. 20].
The layer does not define its own new standard, but rather uses existing LAN
and WAN standards, such as Ethernet, Wireless LAN (WLAN) and Frame Relay.
Therefore, most of the work is carried out by the network device drivers.
We have now briefly discussed the four layers of the TCP/IP protocol stack.
Figure 16 shows an overview of the protocol stack and lists some of the commonly
used protocols.
2.8
IP addressing
IP addressing conventions are important when setting up networks, as administrators
often need to configure them manually. IP version 4 (IPv4) defines IP addresses as
32-bit data types commonly represented as four 8-bit unsigned integers separated by
a dot, e.g. 192.168.1.255, for human readability. A more recently implemented IP
version 6 (IPv6) uses 128-bit addresses. However, the thesis focuses only on IPv4
because it is still widely in use today and is simpler in design.
2.8.1
IP address classes
The IPv4 addresses were originally divided into network and address parts [19, p. 81].
The first one, two or three bytes of the address identify an entire network and the
rest of the bytes identify a node. The number of bytes used for the network part
28
Table 2: Attributes of network classes A, B and C.
Class
A
B
C
Address range
Networks
Nodes
1.0.0.0 – 127.255.255.255 126
16 777 216
128.0.0.0 – 191.255.255.255 16 384
65 534
192.168.0.0 – 223.255.255.255 2 097 152 254
Network bytes
1
2
3
defines the class of the network. The class specifies the number of unique networks
allowed and the number of nodes supported per network.
The three main class types are Class A consisting of a small number of networks
with a large number of nodes, Class B that holds a moderate amount of both networks
and nodes, and Class C that consists of a large number of networks with a small
number of nodes. Table 2 shows the address ranges and the number of networks and
nodes per network supported by each mentioned network class.
However, IPv4 reserves two addresses in class A for special purposes. The address
0.0.0.0 cannot be assigned to a network, as it is reserved for broadcasting to all nodes
on the current network. The other special address is 127.0.0.1. Nodes use it as a
loopback address for communication with itself.
The amount of network bytes denote the number of bytes used for the network
portion in an IP address. Every IP address in a same network contain the same
network bytes, e.g. nodes 192.168.100.1, 192.168.100.120 and 192.168.100.255 share
the same network 192.168.100.0.
To compare the network portions of two IP addresses, IP uses a technique called
netmasking. IP constructs a bitmask that contains as many 1s as the network portion
is long and the rest are 0s. The protocol then extracts the network address portion
from the IP address using a simple bitwise AND operation on the address and the
bitmask. For example, a bitmask 255.255.255.0 would be used for a class C address
192.168.100.127. The protocol would identify the network 192.168.100.0.
2.8.2
Internet-unique and private IP addresses
If a network would never be connected to the Internet, nodes can basically use any
IP address numbers as long as the above mentioned rules are followed. A general
convention is to use class C addresses for clarity, as administrators can assign unique
class C network numbers to different LAN segments. Additionally, each segment has
room for 254 nodes. However, if a network is eventually connected to the Internet,
the manually assigned IP addresses might cause address conflicts.
There are two methods to prevent address conflicts. The first is to request an
Internet-unique IP address from an Internet Service Provider (ISP). ISP allocates
IP addresses automatically, and the addresses are not used anywhere else in the
Internet. However, ISPs are often commercial internet access providers and take a
fee of this service.
29
Table 3: Private network classes.
Class
A
B
C
Address range
Netmask
mask bits
10.0.0.0 – 10.255.255.255 255.0.0.0
8
172.16.0.0 – 172.31.255.255 255.240.0.0 12
192.168.0.0 – 192.168.255.255 255.255.0.0 16
The second method is to use restricted IP addresses designated for private networks.
The standard RFC 1918 [23] defines an address space for each network class that is
reserved for private networking. Table 3 shows the address ranges according to class.
Nodes using these addresses cannot connect outside the LAN, as routers discard
packets using reserved addresses. In order to connect outside the LAN, a gateway, a
network address translator or a proxy server is required.
2.8.3
Classless addressing
Nowadays, IPv4 addresses are mostly considered classless, and the traditional boundaries between class A, B and C addresses can be ignored. Instead, network addresses
are grouped using netmasks.
An alternative notation shows the netmask in conjunction with the IP address.
This can be achieved by including a slash and the number of 1s in the netmask at
the end of the address. For example, an address 192.168.1.12 with a netmask of
11111111 11111111 11111111 11000000, i.e. 255.255.255.192, would be written as
192.168.1.12/26.
2.8.4
Broadcast address
Additionally, a network needs to define a special broadcast address, which enables
nodes to effectively send datagrams to all other nodes in the current network. Standard
RFC 919 [24] defines a convention that uses the last IP address in the network address
range for broadcasting. For example, the broadcast address of the network 192.168.1.0
would be 192.168.1.255. Broadcasting is only available for UDP datagrams.
2.9
Audio over IP
Audio over IP (AoIP) denotes the distribution of digital audio over TCP/IP networks.
In spite of being widely used in Internet radio broadcasting, media streaming and
telecommunication, AoIP is rarely utilized in audio hardware in recording studios, live
performance situations or consumer electronics [25]. Commercial audio equipment
often use point-to-point technologies, such as the S/PDIF (Sony/Philips Digital
Interface Format) or the AES3 (Audio Engineering Society 3) [26] instead.
The transport of professional audio over a network has so far been challenging
due to the characteristics distinct to audio. Uncompressed high-fidelity audio re-
30
quires a high bandwidth compared to audio compressed with lossy coding methods.
Available bandwidth might be exceeded due to high sample resolution and sampling
rate, especially in multi-channel applications. In addition, low-latency constraint
prevents the processing and look-ahead required for achieving data reduction, for
example, using lossless decoding. At the same time, the demand for system-wide
synchronization calls for an efficient clock recovery process because sending clock
signals at sample-rate is infeasible in larger networks.
Despite the challenges, the current trends and advancements in network technologies show signs of AoIP becoming more common in both professional and consumer
audio hardware. First, the bandwidth constraints alleviate as Gigabit Ethernet
becomes more common. Second, the infrastructure for network-based audio is more
attainable as the number of broadband Internet connections at home increases and
the cost of distributing the broadband connection to multiple personal computers
(PC) and network applications decreases. Lastly, PC and consumer electronics technologies are converging as PC manufacturers are moving to compete with the living
room electronics manufacturers. Simultaneously, consumer electronics companies are
adding networking features to their products.
Currently, there are several commercially patented implementations of audio
over Ethernet, such as Dante, Livewire and EtherSound. Some implementations
utilize the IP protocol, but others are based directly on the data link layer. The
used data link layer is often Ethernet as wireless networking has proven unreliable
in professional audio applications. Additionally, customers are able to build their
system with standard Ethernet cabling and equipment.
31
3
Research material and methods
In this chapter, we discuss the methods used in the implementation.
3.1
Single board computers
A single board computer (SBC) generally refers to a computer with a processor,
memory and input/output (I/O) connectors on a single circuit board [27]. In the
past, SBCs differed from desktop personal computers (PC) by requiring no expansion
boards for peripheral functions e.g. audio, video and networking. Nowadays, most
PCs include basic peripherals integrated on the motherboard and expansion boards
are mainly used for high performance graphics and audio.
As integrated circuits have developed, SBCs have become smaller, cheaper and
more powerful. While their performance might not be on par with the cutting edge
PCs, SBCs have a higher performance/price ratio due to their low-cost. Another
advantage SBCs have over typical desktop PCs is their extremely low power consumption. A PC can consume power from 100 W up to 1 kW depending on the
peripherals. In contrast, most SBCs consume power between 2 W and 10 W [28].
Because of these traits, SBCs are widely used in open-source embedded systems
development, educational purposes and do-it-yourself projects.
The benefits of SBCs raise interest to study their applicability in real-time audio
signal processing tasks. Due to their portability, the computers could work as plugand-play audio effects for digital musical instruments, room correction equalizers
for loudspeakers or processing units for adaptive noise cancellation. Programming
DSP applications for SBCs is straightforward, as most of them support Linux
based open-source operating systems. The community have developed numerous
C/C++ programming tools and real-time audio APIs for Linux, whereas digital
signal processors typically require programming in assembler or using a hardware
specific C/C++ API.
However, general purpose operating systems are inherently not designed for running real-time applications. Operating systems perform routines, such as scheduling
and I/O polling that might temporarily take CPU time away from the real-time
process. In the worst case this might cause unpredictable under-runs resulting in
noise in the audio output. To overcome this problem, we can use a specialized
real-time operating system, recompile the Linux kernel with real-time settings or
simply kill unnecessary processes and configure the operating system to prioritize
our DSP application.
Currently, a myriad of SBCs from different manufacturers can be found in the
market. In order to choose a suitable computer for our application, we need to
compare the specifications of a group of SBCs in the same price range. The relevant
hardware specifications for DSP tasks are the CPU type, clock speed, the amount
of memory and the I/O bus clock speed. We first explain why these specifications
are important from our point of view and then evaluate the specifications of four
currently popular SBCs in the sub – 100 A
Cprice range.
32
32-bit scalar add
A0
B0
SIMD parallel add
C0
A1
B1
C1
A2
B2
C2
A3
B3
A0
B0
C0
A1
B1
C1
A2
B2
C2
A3
B3
C3
C3
Figure 17: Comparing SIMD parallel add with 32-bit scalar add. [29]
3.1.1
CPU considerations
The CPU is the heart of a computer and especially important when running arithmetic
heavy DSP programs. Besides clock speed and cache layout, we are interested in
possible extensions for floating point arithmetic to speed up DSP code. Nowadays,
most processors include a single-instruction multiple-data (SIMD) extension to speed
up vector arithmetic. As the name suggests, SIMD execution performs a single
instruction on multiple data elements at once. SIMD execution utilizes large registers
that fit several small data values commonly used in image and audio processing [30].
The processor then performs an instruction on the large register instead of multiple
small registers. Fig. 17 illustrates a SIMD addition of four 32-bit floating point values
compared to a regular scalar addition. In audio and video processing, the data is
always contained in vectors and typically the same instructions are performed on
each vector element. Therefore, SIMD execution introduces data level parallelism,
effectively speeding up vector arithmetic.
3.1.2
RAM requirements
The amount of random access memory (RAM) needs to be sufficient enough to run
Pure Data and store two long impulse responses, as well as two segments of the input
signal in the frequency domain. The amount of bytes Mt required to store audio
data in 32 bit floating point is given by
Mt =
32
fs Nt
8
(37)
where fs is the sampling frequency and Nt is the length of the array in seconds.
Due to the property of circular convolution, as discussed in Section 2.4.2, each
input and impulse response frame needs to be augmented with zeros doubling their
length before the FFT. Additionally, as the frames are stored in the frequency
domain, the real and imaginary parts need to be stored separately. Therefore, the
amount of required memory Mf is four times as much as in time domain: Mf = 4Mt .
33
Table 4: Specifications of four low-cost single board computers. [31, 32, 33, 34]
BeagleBone Black
CPU
Clock
RAM
Bus clock
Ethernet
Power
Cubieboard2
Cortex A8 (2x Cortex A7 dual
PRU)
core
1 GHz
900 MHz
512 Mb DDR3
1 Gb DDR3
800 MHz
480 MHz
10/100 Mbit/s
10/100 Mbit/s
<460 [email protected] V —
(<2.3 W)
ODROID C1+
RaspberryPi 2
Cortex A5 quad Cortex A7 quad
core
core
1.5 GHz
900 MHz
1 Gb DDR3
1 Gb LPDDR2
792 Mhz
400 MHz
10/100/1000 Mbit/s 10/100 Mbit/s
<0.5 [email protected] V 330 [email protected] V
(<2.5 W)
(1.65 W)
Altogether, the required memory M for two input segments and two impulse responses
of length Nt is given by
M = 64fs Nt
(38)
Therefore, using two channels, a sampling frequency of 48 kHz and one second long
impulse responses would require 64 b x 48 kHz x 1s = 3.072 Mb of memory. As long
impulse responses, e.g. acoustic measurements from large spaces, are rarely more
than 10 s in practice and most modern SBCs come with at least 512 Mb of RAM, it
is unlikely that the amount of memory would become an issue. A more consequential
attribute in applications that access memory often is the I/O bus clock speed. The
I/O bus connects the CPU to the RAM and the peripherals. The clock speed of the
bus determines the memory data access speed, potentially creating a performance
bottleneck.
3.1.3
Comparison
Table 4 lists the specifications of four currently popular SBCs that at the time
of writing this cost less than 100 A
C. All of the four SBCs come with an ARM
Cortex-A series processor. The Cortex-A processor profile is designed specifically
for mobile devices that run complex functions and applications (the A stands for
application), e.g. fully featured operating systems [29]. The BeagleBone Black comes
with the high-performance Cortex-A8, Cubieboard2 and Raspberry Pi 2 use the
energy-efficient Cortex-A7 and the ODROID C1+ has the entry level Cortex-A5.
Table 5 lists a number of additional specifications of these three Cortex-A series
processors.
Table 4 shows that BeagleBone Black comes with 512 Mb of RAM while all the
other SBCs come with 1 Gb. As stated previously, the amount of RAM is more
than sufficient in all boards. The I/O bus clock speed favours BeagleBone Black
and ODROID C1+ which have about double the clock speed than Cubieboard and
Raspberry Pi 2. Unlike other boards that use the general DDR3, Raspberry Pi 2
utilizes the low-power LPDDR2 memory.
34
Table 5: ARM Cotex-A series processor specifications. [29, p. 2-9, p. 8-11]
Cortex-A5
Clock speed
Cores
Peak integer throughput
1 GHz
1–4
1.6
DMIPS/MHz
L1 Cache (data)
4 KB – 64 KB
L1 Cache (inst)
4 KB – 64 KB
L2 Cache
–
TM
NEON extension
Yes
Hardware divide
No
Fused multiply accumulate No
Cortex-A7
Cortex-A8
1 GHz on 28 nm
1–4
1.9
DMIPS/MHz
8 KB – 64 KB
8 KB – 64 KB
128 KB – 1 MB
Yes
Yes
Yes
1 GHz on 65 nm
1
2 DMIPS/MHz
16/32 KB
16/32 KB
0 KB – 1 MB
Yes
No
No
The power consumptions in Table 4 are the typical bare-board power consumptions
stated in the product specifications or wiki pages of the SBCs. No official power
specifications were found for Cubieboard2. Raspberry Pi 2 shines here consuming
only 1.65 W with bare board, arguably due to the power efficient Cortex-A7 processor
and the low voltage LPDDR2 memory. However, the power consumption in practical
use depends highly on the connected USB peripherals and the total processing load.
All of the four SBCs recommend a power supply that supports 2 A current with
5 V voltage resulting in a the maximum power output of 10 W. The maximum
consumption can be achieved by connecting several USB-powered devices to the
boards.
3.1.4
Raspberry Pi 2 Model B and HiFiBerry
This thesis uses the Raspberry Pi 2 Model B for the following reasons:
1. Being the cheapest of the four SBCs, Raspberry Pi 2 has a good cost-performance
ratio. Depending on how well the multiple cores can be utilized, the specifications appear solid overall. Beagle Bone Black would be the preferred
choice according to raw performance, yet there are other factors that speak for
Raspberry Pi 2.
2. Currently, the Raspberry Pi foundation is one of the most popular SBC manufacturers around. Hence, the open-source community around Raspberry Pi is
numerous, providing plenty of online forums, guides and tutorials, as well as
software and hardware projects showcasing the potential of the computer.
3. A quality DAC expansion board is available for Raspberry Pi 2. The HiFiBerry
DAC expansion boards by Modul 9 [35] ensure low latency audio output up to
192 kHz/24-bit stereo.
35
Figure 18: A simple Pure Data patch. [36]
The HiFiBerry DAC expansion boards are available in four versions: DAC+ Light,
DAC+ Standard RCA, DAC+ Standard Phone jack and DAC+ Pro. We use the
DAC+ Standard RCA because it provides the DAC with regular stereo RCA jacks.
The next subsection describes the Pure Data graphical programming environment
which is the platform for the implemented real-time convolution system.
3.2
Pure Data
Pure Data is a real-time graphical programming platform for audio and video
processing [36]. The graphical programming interface appeals especially to artists
and engineers with little traditional programming experience. Typical use cases of
Pure Data include live music performance, audio analysis, interfacing with sensors,
and controlling robots. Pure Data is open source software and available for various
operating systems including Windows, Linux and OS X.
In this thesis, we use Pure Data because it provides a modular environment for
building a complex DSP system using a combination of built-in and self-implemented
objects. Additionally, Pure Data is free and it is licensed in a way that developers,
engineers and artists are able to use it in any free or commercial application, provided
that the original copyright notices are retained.
3.2.1
Graphical programming interface
Traditionally, programmers write functions and data structures in a programming
language to a text file. Graphical programming uses visual objects and connectors to
represent functions, data structures and data flow. Pure Data implements graphical
programming by presenting a blank window called a patch where a programmer can
create objects and connections. Figure 18 illustrates a simple Pure Data patch.
36
Pure Data comes with a series of ready made function objects that perform various
tasks from simple arithmetic to complex DSP operations. As Fig. 18 shows, objects
can also be user interface objects like sliders, buttons and tables. Objects generally
contain a number of inlets and outlets for the purpose of drawing connections to
other objects. Signal processing objects are distinguished with a special character
‘∼’ at the end of the object name. In general, the usage resembles early electronic
music where artists used physical patch cables to connect signal processing devices
with each other in real time in order to create music.
3.2.2
Programming external objects for Pure Data
In addition to the set of built-in objects, plenty of objects implemented by the
community are available in the Internet. These additional objects are often referred
to as externals to distinguish them from the in-built objects that come with the
Pure Data installation package. Developers can implement externals in C language
by simply including the Pure Data header file and using the functions defined in
the header. A tutorial by Zmölnig [37] is a good starting point for learning PD
programming because it explains the basic structure of an external and describes
the usage of the core functions. Otherwise, a programmer can only study the source
code of external objects made by other developers.
Pure Data structures its objects as many other C and C++ based real-time
audio APIs. A typical real-time audio processing program is a small specialized
sub-program inside a larger host program, for example, an effect plug-in inside a
digital audio workstation or, as in our case, a DSP object inside Pure Data. This type
of software is commonly implemented using classes and object oriented programming.
Class instances can be created and removed during the execution of the host program.
Furthermore, a user can easily include new sub-programs by inserting compiled
classes in a directory known by the host program. Unlike C++ that implements
object oriented programming by definition, C enables similar functionality through
structs and functions.
An algorithm class contains the data space required by the particular algorithm.
In other words, all the parameters and data containers required by the algorithm,
e.g. gain coefficients and delay line arrays, are declared as members of the algorithm
class. The data space should at least contain pointers to input and output frame
arrays, allowing the algorithm to perform as a bypass unit. Additionally, the class
should contain at least the following four methods.
Constructor declares an instance of the class. It initializes the algorithm parameters
and allocates memory for the data structures.
DSP method adds the class to the DSP tree by granting the class a pointer to the
audio I/O buffer. In Pure Data the DSP method is called each time DSP is
switched on. This method is also called when audio settings, e.g. sampling
rate or audio I/O buffer size, are changed.
Perform method carries out the actual signal processing tasks. The method
iterates over the samples in the audio buffer and DSP algorithms can be
37
Pure Data
Construct
Include to DSP tree
Messages
Destruct
myAlgorithm_tilde
+class data space
+myAlgorithm_tilde_new()
+myAlgorithm_tilde_dsp()
+myAlgorithm_tilde_perform()
+myAlgorithm_tilde_free()
+myAlgorithm_tilde_setup()
Figure 19: Class diagram of a Pure Data external. Pure Data interfaces with the
external using the class methods/functions.
implemented inside this process loop. This method is called each time a new
audio buffer is available. It should be noted that failing to process all the data
before the next audio buffer is available causes the execution of the perform
method to be interrupted and the method begins processing the new buffer.
Destructor frees the memory that was reserved for the algorithm class by the
constructor. The method is called when the DSP object is deleted.
Fig. 19 describes a class diagram of a Pure Data external. The class is named
myAlgorithm. The postfix _tilde marks the class as a signal object in Pure Data.
The postfixes _new, _dsp, _perform and _free denote the functions described earlier,
respectively, and are required by Pure Data. Additionally, an external requires a
set-up function that attaches the external to the pool of objects in Pure Data.
If a programmer wants the external to receive messages in Pure Data, he/she has
to define functions to process each type of message. Message processing functions
are marked with a prefix that states the selector of the message. For example, if
the myAlgorithm external receives a message open /home/pi/foo.wav it requires
a function named myAlgorithm_tilde_open in order to receive the file path string.
Programmers can also freely define their own functions inside the external.
3.2.3
Compiling
Externals can be compiled with any C compiler, yet GCC (Gnu Compiler Collection)
is recommended [38] as it has an extensive set of optimization options for the ARM
architecture. Moreover, GCC comes pre-installed with most Linux distributions.
The GCC optimization options are invoked by defining the respective flags in the
compiler options. Compiler and linker options are commonly written in scripts called
makefiles in Unix environment. The scripts are executed with the make utility.
38
fftw3
cwplib
kissfft
krukar
bloodworth
monnier
ransom
100
10
fftw3
kissfft
qft
cross
100
10
262144
131072
65536
32768
16384
FFT window size
8192
4096
2048
1024
512
256
128
64
32
16
8
4
1
2
262144
131072
65536
(a)
32768
FFT window size
16384
8192
4096
2048
1024
512
256
128
64
32
16
8
4
2
1
1000
performance (mflops)
performance (mflops)
1000
(b)
Figure 20: The a) in-place and b) out-of-place forward transform performance of a
number of FFT-libraries as a function of transform size.
Pure Data externals are essentially dynamic libraries. Therefore, we have to
include the linker flags -export_dynamic and -shared in Linux or -bundle in OS X.
Other linker flags include additional libraries to be linked with our program, as
well as the output and and input file names. The output file needs a special file
extension in order to be recognized by Pure Data. The extension is different among
operating systems, for example, in Linux the extension is .pd_linux and in OS X
it is .pd_darwin. The full makefile with all compiler and linker flags is found in
Appendix A.
In this section, we gave a brief overview on Pure Data and how to program
external objects for it. In the next Section we discuss the FFT library used by our
external.
3.3
FFT library
In order to execute fast convolution, our program requires to compute FFTs. Several
open-source FFT libraries can be found online. We use FFTW (Fastest Fourier
Transform in the West) version 3.4.3 by Frigo and Johnson [39] because it is easy
to use, well documented and supports a wide range of platforms including the
ARM processor architecture. The library also supports multi-threading and SIMD
execution, including ARM Neon since version 3.3.1, potentially improving the FFT
performance on the Raspberry Pi 2.
The performance of FFTW was compared with a number of other ARM-compatible
libraries by running a benchmark on the Raspberry Pi 2. The developers of FFTW
provide a benchmark software called benchFFT [40] to measure the performance
of the library and compare it against other open-source FFT-libraries. The source
code package includes several open-source FFT-libraries for the benchmark and
scripts for plotting the results. Fig. 20 shows the performance of the benchmarked
39
FFT-libraries in MFLOPS (Million Floating point Operations Per Second) as a
function of transform size in samples.
Figure 20a shows an in-place transform, i.e the output is stored overwriting the
input array, whereas Fig. 20b shows an out-of-place transform where the input and
output are in separate arrays. The transform in both graphs is one-dimensional
and executed with single precision (32-bit float) data. The figures show clearly that
among these libraries, FFTW3 is able to perform the most theoretical MFLOPS on
Raspberry Pi 2 regardless of the FFT window size.
The FFTW library supports several computer architectures and it can be optimized for each one. When installing FFTW or benchFFT, it is important to give the
configure script as much information about the underlying architecture as possible.
We configured both FFTW and benchFFT using the appropriate flags for the ARM
Cortex-A7 architecture. The configure flags and the install procedure are described
more in detail in Appendix A.
Next, we briefly describe how FFTW is used in C applications. The following
code describes a simple C program that uses FFTW in order to compute a one
dimensional forward DFT.
#include <fftw3.h>
...
{
fftw_complex *in, *out;
fftw_plan p;
...
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
...
fftw_execute(p); /* repeat as needed */
...
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
FFTW does not use a fixed algorithm to compute the DFT, instead it measures
the underlying system in order to find the most optimal algorithm for the given
hardware [39]. Therefore, the FFTW performs the transform in two phases. First, a
planner method is executed, which produces the algorithm as described earlier. The
planner requires a number of parameters, including the FFT size and the input and
output arrays. The method returns a data structure referred to as a plan. Second,
the plan is executed by calling an executing method. The method takes a plan as
a parameter, transforms the input array and stores the result in the output arrays
designated by the plan. The library also provides its own methods for freeing user
allocated plans and arrays.
40
3.4
Audio over IP
Transmitting audio to the Raspberry Pi 2 is not trivial due to the fact that neither
the Raspberry Pi 2, nor the HiFiBerry DAC include audio input connectors. However,
like most SBCs, the Raspberry Pi 2 comes with a high speed Ethernet interface.
Networking provides a high throughput and low latency data transfer between the
host and slave computers enabling multiple channels to be transmitted simultaneously
with few cables. Moreover, no DA and AD conversions are required in between.
We are also able to send parameter data in conjunction with each network packet
enabling control messages to be sent to the DSP program.
Pure Data externals for sending and receiving uncompressed audio over a network
are available online. We use the open-source externals netsend~ and netreceive~
by Remu [41], based on netsend~ by Olaf Matthes. However, we need to slightly
modify the netsend~ source code due to 32/64-bit compatibility issues and to fulfill
the system specifications. The audio over IP externals are further discussed in
Section 4.1.
41
(a)
(b)
Figure 21: The audio over IP external objects bcast∼ a) and bcreceive∼ b) defined
for 4 channels and UDP protocol.
4
Implementation
In this Chapter we discuss the implementation of the system. The thesis implements
the Pure Data external objects bcast~ and bcreceive~ for sending and receiving
audio over IP, as well as the Pure Data external object conv~ for real-time fast
convolution. The section discusses the implementation of these objects and the Pure
Data patches on the host and receiver computers. We also discuss the convolution
algorithm, as well as the C programming considerations for the ARM Cortex-A7
processor.
4.1
Audio over IP externals
The bcast~ external object is a modified version of the netsend~ external object
developed by Remu and Olaf Matthes. The bcast~ object is described in Fig. 21a. It
takes the number of channels and the protocol type as a parameter. A connection is
established by sending the IP address and the port number of the receiving computer
to the external using a connect message. The external outputs a stream of network
packets, each one consisting of a header tag and a number of audio sample buffers.
In some systems the size of the network packet is limited, for example in OS X
the UDP packet size is only allowed to be 1500 bytes long, in which case the audio
data is fragmented over multiple packets. However, the network layer automatically
reconstructs fragmented packets and the receiving Pure Data object bcreceive~
buffers the audio data maintaining constant audio stream.
Similar to bcast~, the bcreceive~ external object is a modified version of
netreceive~. Figure 21b shows an example of the usage of bcreceive~. The object
takes the port number, the number of channels and the protocol type as parameters
and outputs the received audio data from its outlets.
In order to reduce network traffic, the audio is transmitted using broadcasting.
Broadcasting enables the host to send all data to the broadcast IP address, instead
of sending individual packets to each SBC. This also means that each device receives
the data for all devices. Therefore, the netsend~ and netreceive~ externals must
be modified so that the receiving patches are able to distinguish the data that is
designated to them.
42
In practice, the logic that distinguishes the data for each device is implemented
by including an additional integer parameter per each audio and impulse response
channel and assigning each receiving Pure Data patch an integer that identifies the
device. The integer is passed to bcreceive~ in a message as described in Fig. 21b.
In each patch, the parameters in the broadcast packet are compared to the identifier
and the channels that have a matching parameter are extracted for processing.
4.2
The real-time convolution external
The thesis implements a real-time convolution external for Pure Data referred to as
conv~. The external takes 2k input signals and convolves each signal pair to produce
k output signals. Unlike traditional convolution algorithms that load an impulse
response from a file, conv~ loads the impulse responses as an audio input alongside
the actual input signal. During execution, long impulse responses are stored into
RAM as they are received. Additionally, the algorithm takes in two parameters that
notify the external of the length of the impulse response and when the user changes
it. A third parameter is also used to switch the convolution on or off, i.e. to bypass
the processing.
4.2.1
Storing the input blocks
The network objects transfer long impulse responses parallel to the input audio channels. Therefore, the convolution external must gradually store the received impulse
response blocks hn into an array until the full impulse response is reconstructed. An
equal amount of input audio blocks xn must also be stored at each process iteration.
The correct convolution result requires each input audio block to be convolved with
each impulse response block as described in Section 2.4.1.
The external requires information about when to start and stop storing the input
blocks. As mentioned earlier, the external takes three parameters per source (a
channel pair that consists of an input channel and an impulse response channel). The
first parameter is essentially a flag which has either a value of 0 or 1. An opposite
value to the one at the previous process iteration indicates that the user has selected
a new impulse response. Thus, the current impulse response block is the first block
of that response.
The second parameter is the length of the impulse response in number of samples.
The external converts the length into number of blocks rounding up to the next
integer. The external counts the number of blocks it stores. When the counter
reaches the length of the impulse response, the impulse response channel is simply
ignored until the first parameter changes value again. Input audio blocks must still
be stored for the amount of iterations it takes for the blocks to be convolved with all
the stored impulse response blocks.
Algorithm 1 describes the logic for storing input blocks xn and hn as a pseudo
code. The first if-statement checks whether the first parameter has changed and
performs the pointer reset and the conversion of the impulse response length as
described earlier. The second if-else-statement checks whether blocks hn must still be
43
Algorithm 1 Logic for storing input blocks xn and hn into storage arrays SH and
SX .
if IR has changed then
end pointer ← 0
block counter ← 0
IR length = dlength(h)/blocklengthe
end if
if stored blocks < IR length then
augment hn and xn with zeros
Hn = FFT{hn }
Xn = FFT{xn }
append Hn to SH
append Xn to SX
increment end pointer
increment stored blocks
else
augment xn with zeros
Xn = FFT{xn }
write Xn to SX
end if
h1
prw
h2
h3
pe
Figure 22: Circular buffer that contains three impulse response blocks. The pointer
pe indicates where the last block ends. The pointer prw is the read/write pointer
that accesses sample indices 0, 1, ..., pe .
stored. If the algorithm has stored enough impulse response blocks the else branch
is executed and only the block xn is stored. In the latter case, the storage arrays are
no longer incremented but the first input block in the array SX is overwritten by the
current block.
Dynamic memory operations, such as malloc and free, are generally not recommended in real-time functions because predicting the time required for allocating
and deallocating memory is difficult [42]. Therefore, the constructor allocates space
for all the data structures required by the external, including two large arrays for
the storage of the input audio and impulse response blocks. The size of the arrays
can be very large, e.g. to contain at least 60 s of audio, as the Raspberry Pi 2 has
1 Gb of RAM and no other object or data structure requires excessive amounts of
memory. The storage array for the impulse response blocks is illustrated in Fig. 22.
In most scenarios, the arrays have plenty of additional empty space at the end. The
44
final sample of the last block is indicated by the end pointer pe . The sample data is
accessed using a read/write pointer prw , which loops back to the start of the array
after passing pe . Therefore, the array operates similar to a circular buffer.
In order to implement efficient fast convolution, the input blocks are stored in
the frequency domain. Thus, the storage process also involves zero augmentation
of the blocks and FFT. The zero augmented blocks of complex numbers require
four times the memory as opposed to the original blocks of floating point numbers.
However, this method effectively decreases the amount of FFTs required, decreasing
the overall computational complexity of the external.
4.2.2
Convolution algorithm
When the input blocks have been received, transformed and stored, the external
convolves the input audio blocks and the impulse response blocks in order to produce
an output block. In Section 2.4.3 we discussed how the overlap-save method generally
requires less operations than the overlap-add method. However, in our case the
overlap-add method is used instead because it benefits more from storing frequency
domain sample blocks. Storing the input blocks in the frequency domain not only
reduces the number of FFTs but it also has the benefit that addition in the frequency
domain is equivalent to addition in the time domain. Therefore, no intermediate
FFTs or IFFTs are required. In order to produce an output block, overlap-add
method requires only one FFT (two if an impulse response block is stored) and one
inverse FFT. The overlap-save method on the other hand must inverse transform
all the sub-results as truncating or augmenting an array is only possible in the time
domain.
As discussed in Section 2.3.3, fast convolution requires 12N log2 (2N ) + 8N real
multiplications. If L is the length of the impulse response in number blocks, the
number of multiplications for the general real-time overlap save or overlap-add method
is given by
C = L(12N log2 (2N ) + 8N )
(39)
When the x and h blocks are stored in the frequency domain, the overlap-save method
only needs L inverse FFTs and the number of multiplications becomes
Cos = 4LN log2 (2N ) + 8N log2 (2N ) + 8LN
(40)
Using the overlap-add method, the number of multiplications is not proportional to
L at all. Therefore, Eq. (39) becomes
Coa = 12N log2 (2N ) + 8LN
(41)
Figure 23 shows the number of real multiplications required by four alternative
convolution methods. The methods are time domain convolution, fast convolution,
45
1010
Number of multiplications
1.
108
2.
3.
4.
106
104
8
16
32
64
128
256
512 1024 2048 4096
Block size (samples)
Figure 23: The number of multiplications as a function of audio block size. Four realtime convolution algorithms are compared: 1. overlap-save time domain convolution,
2. overlap-save fast convolution, 3. overlap-save and 4. overlap-add fast convolution
with blocks stored in the frequency domain.
X2[k]
×
Y20[k]
H0[k]
+
Y2[k]
IFFT
+
y2[n]
a1[n]
X1[k]
×
y2[B], y2[B + 1], ..., y2[2B]
a2[n] = y2[0], y2[1], ..., y2[B - 1]
Y21[k]
H1[k]
X0[k]
×
Y22[k]
H2[k]
Figure 24: The overlap-add convolution algorithm.
overlap-save fast convolution with blocks stored in the frequency domain and lastly
overlap-add fast convolution with blocks stored in the frequency domain. The data
was generated using L = 100 blocks. The graph further demonstrates how time
domain convolution is infeasible at large block sizes but also justifies the use of the
overlap-add method in the case when the input blocks are stored in the frequency
domain.
Lastly, Fig. 24 visualizes the overlap-add convolution algorithm used by the conv~
external. The block diagram shows an example where an input signal is convolved
46
with a length-3B impulse response, where B is the block size. Most of the arithmetic
is performed on frequency domain data. Only the overlapping segment is added in
the time domain. The resulting output sequence y2 [n] is divided into two segments,
the output block y2 [B], y2 [B + 1], ..., y2 [2B] and the overlapping segment a2 [n].
4.3
Host Pure Data patch
We construct the DSP system using Pure Data, as discussed in Section 3.2. We
use internal objects of Pure Data for interfacing with the hardware audio devices,
third party externals for sending audio over LAN and a self implemented real-time
convolution external for the actual processing. Additionally, we can use the Pure
Data patch as a graphical user interface (GUI) on the host computer.
We keep the description of the patches brief as the full understanding about
how the patches function requires prior knowledge about programming and patching
in Pure Data. We recommend exploring the Pure Data manual [36] for further
information about patching in Pure Data.
The host patch is visualized in Fig. 25. The sub-patch in Fig. 25a opens and
plays back audio files using the getInput~ and getIR~ sub-patches. The sub-patch
generates the parameter array by collecting the impulse response length and the
change flag from getIR~ and the device number and the bypass flag from the message
and toggle objects. The pack object assembles the parameters into an array which
is sent to the first inlet of the bcast~ object. The bang objects at the top invoke a
dialogue when clicked enabling the user to open an audio file using a browser window.
The array1 is a wave table object that is required by the soundfiler object in the
getIR~ sub-patches in order to obtain the length of the impulse responses.
The sub-patches getInput~ and getIR~ are shown by Fig. 26a and Fig. 26b. The
getInput~ patch simply opens an audio file and reads it to the signal outlet. The
file path and the read settings are specified by either the message received in the left
inlet, or by the user input in a file browser invoked by the openpanel object. The
getIR~ sub-patch is slightly more complicated as it must also provide the length
of the response and toggle a flag every time a user changes the impulse response
file. The signal outlet in the left transfers the impulse response, the middle outlet
sends the impulse response length and the rightmost outlet transfers the boolean
parameter that indicates a new user input.
The main host patch, visualized in Fig. 25b, acts as a user interface with messages
that correspond to the various receive objects in the sub patch in Fig. 25a. This
way, the sub-patch hides the complex details of the system and users can simply write
file paths in the message fields and click them to open audio files. The connect and
disconnect messages are necessary for opening and closing the network connection.
The dsp on and off messages are provided for convenience.
4.4
The receiving Pure Data patch
Figure 27 describes the Pure Data sub-patch that runs in the receiving SBC. The
receive dev object sends a message to the bcreceive object which initializes the
47
(a)
(b)
Figure 25: The Pure Data patch at the host computer. The patch consists of a
sub-patch a) and the main patch b).
48
(a)
(b)
Figure 26: The sub-patches for a) reading input audio and b) impulse response files.
Figure 27: The Pure Data patch at the receiving single board computer. The outlets
of bcreceive∼ from left to right are x1 (n), x2 (n), h1 (n), h2 (n), the parameter array
and a string that prints out information.
identifier of the patch. This identifier is used to extract the correct input audio buffers
from the network packet as discussed in Section 4.1. The conv~ external receives
the input signals directly from bcreceive~. In this particular patch, bcreceive~
defines 4 output channels and conv~ defines 4 input and 2 output channels. The
fifth outlet of bcreceive~ is used to pass the parameter array to conv~. The last
outlet produces printable debugging information when bcreceive is banged.
49
The main patch simply contains a dac~ object that interfaces with the sound
card. The two outlets of the sub-patch connect to the inlets of the dac~ object.
The convolution must be performed in a sub-patch because the dac~ object only
supports the default block size of 64. Even though the sub-patch sends longer output
buffers, the main patch only reads the outlets of the sub-patch 64 samples at a time
segmenting the long buffer into many 64 sample buffers.
As the RaspberryPi 2 runs optimally “headless”, i.e. without any peripherals
attached and without GUI, the Pure Data patch must be run from the command
line using a Secure Shell (SSH) client. Unfortunately, Pure Data lacks an interactive
command line interface and messages can only be sent to the patch as parameter
flags at start up. Therefore, changing the parameters of the receiving patch, such
as the amount of channels and the port number, is not possible while Pure Data
is running. To incorporate more or less channels, Pure Data must be restarted by
running another patch with the proper parameters. The command for running the
receiving Pure Data patch from the command line is listed in Appendix A.
4.5
Code optimization for ARM processors
When implementing real-time DSP systems, code optimization is essential for achieving the best possible performance on the given hardware. With careful system design,
low-cost hardware can deliver the same performance as a poorly programmed system
on an expensive hardware. Additionally, modern processors include special features
for certain applications, such as special DSP instructions, SIMD registers and power
management instructions, that can be utilized by the programmer to increase code
performance and functionality.
The ARM Cortex-A series programmers guide [29] is a good reference for code
optimization details for the Cortex-A7 processor. We now describe three optimization
methods that are the most relevant for a real-time DSP application.
4.5.1
Enabling the Advanced SIMD extension
TM
Modern ARM processors include an Advanced SIMD Extension (NEON ) which
TM
can be utilized to accelerate vector arithmetic as described in Section 3.1.1. NEON
is included by default in the Cortex-A7 processor. Despite being integrated to the
TM
ARM processor, NEON contains an independent execution hardware, a separate
execution pipeline and a register bank distinct from the ARM core register bank.
It provides a 128-bit SIMD instruction set supporting 8- 16- and 32- integer and
TM
single precision floating point data (integer data up to 64-bit). Thus, NEON is
theoretically able to accelerate vector operations on 32-bit data by four times.
TM
In order to enable the NEON
extension, programmers have to notify the
compiler of the following things. First, the pointers to the arrays that will be
processed must not alias, i.e. two pointers can not point to the same memory location.
TM
This prerequisite is required by many optimization options, including NEON . In
C language, pointers are assumed to alias by default, however, programmers can
inform the compiler that a pointer does not share its memory location with any other
50
pointer by declaring the pointer with the keyword __restrict.
Second, the compiler must parallelize the source code in a way that enables efficient
TM
usage of the NEON extension. Thus, programmers must notify the compiler to
vectorize the source code with the appropriate compiler flag. In GCC, vectorizing is
enabled by including the flag -ftree-vectorize in the compile command.
TM
Last but not least, the NEON instruction set is invoked by specifying the GCC
compiler flag -mfpu=neon. In order to use the hardware floating point registers, the
compiler flag -mfloat-abi=hard is also necessary.
4.5.2
Cache optimization
In modern computers, the computational bottlenecks are typically caused by cache
misses. The time required to read and write data increases the further in the memory
hierarchy the data is located. Typical processors contain two levels of caches referred
to as L1 and L2 cache that act as small but fast intermediate data storages between the
CPU registers and the RAM. When a program runs, the CPU loads the instructions
and the data used by the program into the caches. However, programs with large
dataset rarely fit into the caches completely and a proportion of the program is
stored in RAM. If the data is not found in the L1 cache, a cache miss occurs and the
CPU tries to fetch the data from the L2 cache. Similarly, if the data is not found
in the L2 cache, another cache miss occurs and the data must be fetched from the
RAM. Consequently, the processor uses a significant amount of cycles searching for
the data.
Cache misses occur naturally and are often unavoidable. However, in some cases
careful programming can prevent unnecessary cache misses. Arranging members of
data structures according to how frequently they are used in the code can prevent
cache misses. First, it is beneficial to declare members that are used the most first
as they fit into the first cache line starting from the beginning of the data structure.
Second, it is advantageous to split data structures if some members are accessed
more frequently than others. In order to access a member of a struct, the CPU must
load the struct itself into the cache. Therefore, the members that are used seldom
are fetched for no purpose and they take unnecessary space in the cache. Lastly,
members that are used in the same region in code should be arranged sequentially in
data structures. This way, the members are likely to locate in the same cache line
and no additional data loads are necessary.
4.5.3
Data alignment
Furthermore, data alignment and the type of variables can affect cache efficiency.
Basic C datatypes are not arbitrarily laid out in memory, instead each datatype
is self aligned to start from a boundary according to the size of the datatype. For
example in 32-bit architectures, an integer starts from a word boundary at every
4 bytes, a short starts at each even byte and a long starts at the boundary of 8
bytes. The alignment restrictions prevent datatypes from spanning over multiple
words which would require additional instructions to load a single variable. A char
51
0
8
4
1
12
2
4
(a)
0
8
4
1
2
4
(b)
Figure 28: Struct that places the 4-byte datatype in the middle occupies 12 bytes of
memory a), whereas the struct that places the 4-byte member last occupies only 8
bytes b).
is a special case as it can locate anywhere because accessing a byte takes the same
amount of instructions regardless of its location within a word.
In order to maintain data alignment within structs, C compilers insert padding
after member variables if the next variable does not fit at the byte boundary where
the current member ends. Often, struct padding results in larger memory footprint
than is necessary. Furthermore, it decreases cache efficiency as unnecessary empty
bytes are loaded into the cache.
A programmer can prevent unnecessary struct padding by ordering the members
of structs so that minimal amount of padding is needed. The reordering of struct
members is illustrated in Fig. 28. Declaring the 2-byte variable before the 4-byte
variable decreases the size of the struct by 4 bytes. It should be noted that the
compiler always pads structs to word boundary. Hence the additional 2 bytes at the
end of the struct in Fig. 28a.
4.5.4
Loop termination
According to the ARM Cortex-A7 programmers manual, loop counters that end
to zero are faster than counters that start from zero. This is because the ADD and
SUB assembly language instructions provide a compare with zero for free, whereas a
compare with a non-zero value requires an explicit CMP instruction.
The manual also recommends using 32-bit integers for loop counters because
the ADD and SUB instructions operate natively on two 32-bit registers. If smaller
quantities, e.g. two 16-bit variables are added or subtracted, the compiler might add
additional instructions to handle the overflow.
4.6
Multithreading
As the ARM Cortex-A7 is a quad core processor, the full potential of the CPU
can only be utilized by programs that distribute their tasks to multiple threads.
Additionally, convolution of multiple long impulse responses is theoretically an easily
parallelizable task. Large datasets can be divided into smaller subsets which can be
processed by a number of threads in parallel.
52
However, multithreading typically requires additional mechanisms for ensuring
thread synchronization. Synchronization is required in order to prevent race conditions, deadlocks and starvation during software execution. However, the mechanisms
that prevent these issues introduce computational overhead in real-time applications
and are generally not recommended. One way to avoid these mechanisms is to
initialize separate data structures for each thread and make sure that the code never
accesses overlapping data with another thread.
Unfortunately Pure Data does not support multithreading as such, i.e. multiple
similar signal objects are not processed in parallel. However, Pure Data objects
can be programmed to internally utilize threading libraries, such as Open Multi
Processing (OpenMP) or POSIX Threads (Pthreads).
We implement a parallel thread in the conv~ external object using the Pthreads
library. As Pure Data objects are essentially libraries in itself, it is difficult to
decompose the processing into threads during the initialization of an object. However,
the perform method can be easily parallelized, by simply calling pthread_create()
and pthread_join() in the conv_tilde_perform function. Despite being a quick
and robust solution, creating and destroying the child thread during each iteration
of the perform method introduces additional computational overhead. A more
sophisticated threading mechanism, such as using condition variables or a thread
pool, would potentially increase the algorithm performance. Due to time constraints,
this is left for future development.
In order to optimally utilize threads, the programmer must make sure that the
threads process an equal amount of data. Otherwise the second thread spends time
waiting for the other thread to finish and output buffer underflows might occur.
4.7
Raspbian configuration
The standard Raspbian distribution is a general purpose operating system which is
not optimized for running real-time tasks. However, a number of system settings
can be configured, such as enabling real-time scheduling for Pure Data, disabling
CPU frequency scaling and overclocking the CPU for additional clock speed. A
real-time patch for the Raspbian kernel is also available, however, implementing the
patch would have required a new kernel to be compiled. Due to time constraints,
the implementation of the real-time kernel was left for future development.
The system configuration steps are described in Appendix B. The next section
discusses the measurement methods used in evaluating the implemented system.
53
5
Audio distortion measurements
In order to evaluate the implemented system, we need to carry out a number of
measurements. Audio devices are generally characterized by their linear impulse
response, as well as noise and distortion attributes. In this thesis, the two common
distortion attributes total harmonic distortion (THD) and inter-modulation distortion (IMD) are used. In addition, real-time DSP systems are typically evaluated
according to latency and the computational complexity of the DSP algorithms. In
this chapter, we describe the above mentioned attributes and discuss the methods
used for measuring them.
5.1
Audio distortion
Before discussing distortion measurement techniques in more detail, we need to
define some distortion concepts. Distortion in the context of signal processing occurs
whenever the transfer function of a system alters the waveform of the input signal,
excluding noise, interference, and amplification or attenuation [43]. Distortion divides
into several subcategories.
5.1.1
Linear and non-linear distortion
Distortion can be divided into two categories, linear and non-linear distortion. Linear distortion occurs when a system changes the frequency balance of the input
signal according to its transfer function. Linear distortion does not introduce new
frequency components to the signal, but alters the magnitude or phase of the existing
components.
Unlike linear distortion, non-linear distortion changes the frequency content of the
input signal such that the energy of a frequency component in the input is distributed
into multiple frequencies in the output. A typical example of non-linear distortion is
clipping, i.e. the amplitude of a signal is higher than the current dynamic range. The
amplitude values saturate to the maximum and minimum possible value, shaping
the waveform and introducing additional components in the frequency domain.
5.1.2
Harmonic distortion
Harmonic distortion produces frequency components to the output spectrum at
integer multiples of the input frequencies [44]. Harmonic distortion is most noticeable
when a system is excited with a single tone sinusoidal input signal. The harmonics
produced by the system can then be observed from the output spectrum. Harmonic
distortion is simply specified as the ratio of the magnitude of the harmonic of interest
and the magnitude of the fundamental. Harmonic distortion can be expressed as a
percentage or in decibels. Figure 29 illustrates harmonic distortion in the frequency
domain for a single tone sinusoidal signal.
Total harmonic distortion (THD) is the ratio of the root-sum-square value of all
the harmonic distortion components to the root-sum-square value of the harmonics
and the fundamental. However, typically only the first five distortion orders are
54
|X(f)|
Fundamental
3rd order
2nd order
4th order
f
2f1
f1
3f1
4f1
Figure 29: Illustration of harmonic distortion.
significant, as higher order components are much lower in magnitude and they quickly
fall below noise level. Given a sinusoidal excitation signal with an amplitude A0 , the
THD is obtained by
q
THD =
q
A1 2 + A2 2 + ... + AK 2
A0 2 + A1 2 + A2 2 + ... + AK 2
(42)
where K is the number of harmonics and A denotes the amplitudes of the harmonic
components. Commonly THD is expressed either in decibels or as a percentage.
THD in decibels is obtained by
THDdB = 20 log 10 (THD)
(43)
and the percentage value is obtained by simply multiplying the result of Eq. (42) by
100.
5.1.3
Intermodulation distortion
Besides examining the harmonic distortion produced by single tone input signals, it
is often required to examine the distortion produced by two or more tones played
simultaneously. Intermodulation distortion (IMD) results when the frequency components of a wide band signal interact and produce distortion components not found in
the original signal [43]. In practise, the circuit non-linearities cause amplitude and/or
frequency modulation of the higher frequency components by the lower frequency
components.
Figure 30 visualizes IMD for a two tone input signal with frequencies f1 and f2 .
The distortion products appear at upper and lower sidebands of the higher frequency
tone. The frequencies of the distortion components equal to the sum and difference
55
|X(f)|
f
f1
f2 - 2f1 f2 - f1
f2
f2+f1 f2+2f1
Figure 30: Illustration of intermodulation distortion.
of the upper frequency f2 and integer multiplies of the lower frequency, e.g. f2 + f1 ,
f2 − f1 , f2 + 2f1 and f2 − 2f1 .
IMD for a signal consisting of two sinusoidal tones is calculated similar to THD.
The root-square-sum of the magnitudes of the distortion products is divided by the
root-square-sum of the magnitudes of the distortion products and the modulated tone.
The distortion products from either the positive or the negative sideband can be used.
The higher test tone should be far enough in frequency from the DC-component and
the first tone, so that they do not affect the IMD components. Moreover, care has to
be taken not to measure IMD at harmonic multiples of the test tones. Otherwise the
harmonic distortion components might add up with the IMD components.
5.2
Logarithmic sine sweep method
Impulse response measurements are fundamental tools for characterizing audio devices
and acoustic environments. As discussed in Section 2.3, a linear and time-invariant
system is characterized by its impulse response h(t). The impulse response is generally
obtained by exciting the system with a known signal x(t) and measuring the output
y(t).
The choice of the excitation signal affects the quality of the measurement [45].
Reliable measurements require a perfectly repeatable input signal. Additionally, in
order to obtain a comprehensive frequency response, the input signal should contain
frequencies over the full audible spectrum (20 – 20 000 Hz).
Various excitation signals have been proposed in the literature, including unit
impulses, maximum length sequences, and stepped sine waves. Since the early 2000,
the logarithmic sine sweep technique proposed by Farina [46] has become the industry
standard in acoustic and audio distortion measurements.
Sine sweeps in general have the advantage of producing measurements with high
signal to noise ratios (SNRs) because all the signal energy at any point in time is
concentrated on one frequency. Noise can be further suppressed by averaging over
56
multiple measurements. However, averaging requires a strictly time invariant system,
i.e. if some of the recordings are delayed the average will be shifted. A preferred
method is to use a single long sine sweep. Using longer sweeps results in higher
SNRs, as energy from a longer period of time is packed into an impulse response.
The logarithmic sine sweep method uses a sine wave excitation signal that is
frequency modulated by an exponential function in time. Thus, the instantaneous
frequency sweeps slower at low frequencies and faster at high frequencies resulting in
a pink frequency spectrum (magnitude decreases -3 dB per octave). The motivation
for the exponentially increasing frequency is that it enables the separation of the nonlinear distortion components and the linear impulse response. Due to this property,
the method can be used for characterizing both the harmonic distortion responses
and the linear response of a DAC.
5.2.1
Generating logarithmic sine sweeps
In order to properly design the excitation signal, and to retrieve the harmonic order
responses, we need to discuss how to mathematically derive the logarithmic sweep
equation and the starting times of each harmonic distortion order. As derived by
Farina [46], the logarithmic sine sweep stimulus has the form
t
ω2
ln
 ω1 T 
ω1
 T
x(t) = sin 
 ω2 e
ln ω1





− 1

(44)
where T is the total length of the sweep, and ω1 and ω2 are the lower and upper
limits of the sweeps normalized frequency range. We can now use Eq. (44) to find out
the time delay ∆t at which the function has an instantaneous frequency of N times
the actual one. This represents the time delay between the N th order distortion and
the linear response. The following equation expresses the relationship between the
time delay and the harmonic order [46].
ω2
t
ln

d 
ω
T
1
ω1
e T
N 
dt  ln ω2 
ω1


ω2
t + ∆t
ln

 ω1 T 
d
T
ω1



− 1
 = dt  ω2 e
ln ω1






− 1
 (45)
From Eq. (45) we obtain
∆t = T
ln(N )
ln
ω2
ω1
(46)
The value of ∆t is constant which ensures that each harmonic distortion response
packs at a precise time instant before the linear response. Additionally, as ∆t is
proportional to the logarithm of N , the time differences between the harmonic
distortion responses vary. Higher orders are less spaced than lower orders, as the
higher orders locate further to the left from the linear response.
57
5.2.2
Pre-processing
Additionally, we apply a short fade-in and fade-out to the beginning and the end
of the sweep to reduce pre-ringing artefacts in the measured impulse response [47].
Pre-ringing artefacts are oscillations that appear in the time domain impulse response
before the actual impulse, similar to a sinc function. This is due to the fact that
trapezoidal signal shapes in the frequency domain transform to sinc functions in the
time domain. Setting the upper frequency of the sweep to the Nyquist limit removes
the pre-ringing effect caused by the upper edge of the frequency window. However,
a fade-out is still required to make sure that the signal ends at a zero crossing. A
non-zero final value inflicts a step function to the signal which excites frequencies
across the entire spectrum. Setting the starting frequency to 0 Hz is not possible,
as can be seen from Eq. (44). We apply a long fade-in instead to minimize the low
frequency pre-ringing.
5.2.3
Post-processing
Playing back the logarithmic sweep stimulus x(t) yields the recorded output signal
y(t). In order to extract the transfer function h(t) of the system under test, y(t)
needs to be post-processed. To obtain h(t), we will need to convolve the measured
sweep with a reference sweep f (t) defined in such a way that
h(t) = y(t) ∗ f (t)
(47)
The reference sweep f (t) is a time reversed version of the stimulus signal. The
frequency spectrum of f (t) must also compensate for the pink spectrum of the
logarithmic sweep. We apply a compensation by amplitude modulating f (t) with
an inverse exponential function. Now both the frequency and the amplitude of f (t)
decrease as an inverse exponential function resulting in a +3 dB increase per octave
in the frequency spectrum. Figure 31 visualizes the magnitude responses of x(t) and
f (t).
The resulting transfer function is then windowed in order to extract the linear
response and the harmonic distortion responses. To avoid non-linearities in the
window edges, a half-cosine is applied to the start and end of each window. The
length of the half-cosine is set to 5% of the window length. The window edges are
specified according to Eq. (46) for each harmonic distortion response.
d1 =




 ln(N − r) 
T

(48)
d2 =




 ln(N + r) 
T

(49)
ln
ln
ω2
ω1
ω2
ω1
where the quantity r is an adjustable parameter that determines the distance from
the harmonic response transient. We found that the value r = 0.002 is suitable to
58
0
x(t)
f (t)
Magnitude (dB)
-20
-40
-60
-80
-100
-120
10
100
1k
10k 20k
Frequency (Hz)
Figure 31: The magnitude response of the stimulus sweep and the reference sweep.
cover the segment of the distortion responses above the noise floor. We are also
able to maintain the SNR at a reasonable level because both the amplitude of the
distortion responses and the window length decrease proportional to the distortion
order.
59
x[n]
Host
Ethernet
Firewire
x[n]
yrp[n]
Unbalanced audio
ymt[n]
ADC
DACmt
gmt(t)
DACrp
grp(t)
Figure 32: Measurement set-up.
6
Results
In this section, we analyze the results of the audio distortion, latency and algorithm
performance measurements. The audio distortion and latency measurements were
processed and analyzed in Matlab. Before discussing the results, we first introduce
the measurement set-up.
6.1
Measurement set-up
We assemble a simple loop-back system in order to measure the linear and the
distortion responses of the HiFiBerry DAC+, as well as the overall latency of the
system. The set-up consists of a host laptop, a professional audio interface and a
single board computer. We use a MacBook Pro with OS X 10.9.2 as the host laptop
and MOTU UltraLite mk3 as the audio interface. The host laptop is connected to an
external audio interface because an ADC is required to capture the analogue audio
from the SBC. Moreover, the DAC of the audio interface provides a baseline to
which we can compare the DAC of the SBC. The measurement set-up is illustrated
in Fig. 32.
To measure the HiFiBerry DAC+, a stimulus signal x[n] is sent to the Raspberry
Pi through the network. The signal bypasses all processing, and only goes through
DACrp . The analogue signal grp (t) is then transmitted to the ADC of the MOTU
audio interface which converts the signal back to digital. The host laptop then
receives the digital signal yrp [n]. The laptop and the audio interface are connected
via Firewire.
In order to obtain a reference measurement, the MOTU audio interface is measured
similarly, but the signal is simply looped from the analogue output back to the
analogue input of the interface. The DAC of the MOTU is referred to as DACmt in
Fig. 32.
We use a Pure Data patch to conduct the measurements on the host laptop. The
patch simultaneously reads a test signal from a file and sends it to the network,
60
as well as receives and writes the input signals to a file. This way latency can be
measured by simply comparing the phases of the original and recorded sound files.
6.2
Linear responses of the DACs
We measure the linear frequency responses of the HiFiBerry and the MOTU audio
interfaces using the logarithmic sine sweep method. The logarithmic sine sweep is
generated in Matlab according to Eq. (44). The starting frequency of the sweep is set
to 5 Hz and the high frequency limit is set to the Nyquist frequency 22050 Hz because
we use 44.1 kHz sampling frequency in order to obtain a wide coverage of the spectra.
We used a 40 s long sweep in order to achieve high enough SNR to clearly present the
harmonic distortion responses. Furthermore, a sample resolution of 24 bits was used.
The 24-bit resolution enables us to capture distortion components as low as -144 dB.
With 16 bits, the dynamic range would only reach -96 dB which is insufficient when
measuring low level distortion. The overall amplitude of the logarithmic sweep was
normalized to full scale (0 dBFS).
The HiFiBerry DAC+ uses a Texas Instruments PCM5121 chip which has four
optional digital interpolation filters after an oversampling stage. [48]. Oversampling
is used in digital-to-analogue conversion in order to alleviate the specifications of the
anti-aliasing filter. In order to preserve the original signal spectrum after oversampling,
additional zeros are first inserted between the original sample values. A low-pass
filter then interpolates the zeros between the original sample values reconstructing
the original waveform.
Figure 33 shows the linear responses of the DAC of the MOTU audio interface,
and the HiFiBerry DAC when the default FIR and an optional low latency IIR
interpolation filter is used. As expected, the professional MOTU DAC has the flattest
frequency response with a steep transition band close to the Nyquist frequency. The
interpolation filters of the HiFiBerry DAC have a wider slope than the MOTU.
However, the difference is only in the scale of tenths of decibels. The MOTU and
the HiFiBerry FIR have a low ripple at the high frequencies, whereas the HiFiBerry
IIR filter has no ripple.
In all test cases the signal must be converted back to digital in the MOTU audio
interface. Therefore, it should be noted that the MOTU ADC might also have
an effect in the system response. The effect of the ADC transfer function can be
eliminated if we use the transfer function of the MOTU DAC and ADC as a reference
and analyse the difference to it. In other words, we normalize the linear responses of
the HiFiBerry to the linear response of the MOTU. Given two measurements Y1 and
Y2 , the difference transfer function is obtained by
Hdif =
Y1
Hmt Hadc X
Hmt
=
=
Y2
Hrp Hadc X
Hrp
(50)
where Hadc is the transfer function of the ADC, and Hdac1 Hdac2 are the transfer
functions of the two DACs.
20 log10 |H(f)|, Magnitude (dB)
61
0
-0.05
MOTU
HiFiBerry FIR
HiFiBerry IIR
-0.1
-0.15
-0.2
-0.25
-0.3
10
100
1k
10k
20k
f, Frequency (Hz)
20 log10 |Hdif (f)|, Magnitude (dB)
Figure 33: Linear frequency responses of the MOTU DAC and the two interpolation
filters of the HiFiBerry DAC.
0.05
0
-0.05
HiFiBerry FIR
HiFiBerry IIR
-0.1
-0.15
-0.2
10
100
1k
10k
20k
f, Frequency (Hz)
Figure 34: Linear frequency responses of the two HiFiBerry DACs normalized to the
frequency response of the MOTU DAC.
Figure 34 visualizes the normalized frequency responses of the HiFiBerry DAC
with the FIR and the low- latency IIR interpolation filter settings. The graph does
not show any noticeable changes in the linear responses. The magnitude of the
filters are higher than zero at low frequencies and both responses decrease at the
high frequency roll-off region as expected. The high magnitude fluctuation near the
Nyquist frequency occurs because the magnitudes decrease below the noise threshold.
6.3
Harmonic distortion
The logarithmic sweep method was also used to measure harmonic distortion responses.
We analysed the harmonic distortion responses by windowing the response of each
distortion order from the measured impulse response. The beginning and end points
of the windows were assigned using Eq. (46).
20 log10 |H(f)|, Magnitude (dB)
62
2nd order distortion
-80
MOTU
HiFiBerry
-100
-120
-140
10
100
1k
10k
20k
10k
20k
10k
20k
10k
20k
f, Frequency (Hz)
20 log10 |H(f)|, Magnitude (dB)
(a)
3rd order distortion
-80
MOTU
HiFiBerry
-100
-120
-140
10
100
1k
f, Frequency (Hz)
20 log10 |H(f)|, Magnitude (dB)
(b)
4th order distortion
-80
MOTU
HiFiBerry
-100
-120
-140
10
100
1k
f, Frequency (Hz)
20 log10 |H(f)|, Magnitude (dB)
(c)
5th order distortion
-80
MOTU
HiFiBerry
-100
-120
-140
10
100
1k
f, Frequency (Hz)
(d)
Figure 35: Harmonic distortion responses of the MOTU and HiFiBerry DAC.
63
Harmonic distortion (%)
10−2
10−3
2nd order
3rd order
4th order
5th order
10−4
10−5
10
100
1k
10k
20k
10k
20k
f, Frequency (Hz)
(a)
Harmonic distortion (%)
10−2
10−3
2nd order
3rd order
4th order
5th order
10−4
10−5
10
100
1k
f, Frequency (Hz)
(b)
Figure 36: Harmonic distortion responses of a) the MOTU and b) the HiFiBerry
DAC+ in percentages.
Figure 35 shows the 2nd, 3rd, 4th and 5th order harmonic distortion responses of
both the MOTU and HiFiBerry DAC. The harmonic distortion of the MOTU DAC
is slightly lower in each graph. The 3rd and 5th distortion orders of the HiFiberry
DAC show a significant increase in the 2 kHz–20 kHz region, while the harmonic
distortion of the MOTU DAC remains constant in the whole frequency band.
Harmonic distortion is typically also represented as a percentage of the linear
response. Figure 36 shows the harmonic distortion of the HiFiBerry DAC+ and the
MOTU audio interface as a percentage of the linear response. The 3rd and 5th order
distortion components of the HiFiBerry DAC+ show a clear linear increase starting
from 1.5 kHz, which might imply that odd-order distortion components are generally
dominant in the higher frequency region.
Magnitude (dB)
64
0
-50
-100
6700
6800
6900
7000
7100
7200
7300
7200
7300
f, Frequency (Hz)
Magnitude (dB)
(a)
0
-50
-100
6700
6800
6900
7000
7100
f, Frequency (Hz)
(b)
Figure 37: IMD components of a) the MOTU and b) the HiFiBerry DAC.
Table 6: IMD of the MOTU audio interface and the HiFiBerry DAC
6.4
MOTU
IMD (SMPTE) -94 dB/0.002%, 60 Hz/7 kHz, 4:1, 0 dBFS
HiFiBerry DAC+
IMD (SMPTE) -74 dB/0.02%, 60 Hz/7 kHz, 4:1, 0 dBFS
Intermodulation distortion
IMD was measured according to the standard SMPTE RP120-1994 by the Society
of Motion Picture and Television Engineers [49]. The standard specifies two test
tones of 60 Hz and 7 kHz combined in a 12 dB (4:1) amplitude ratio. The overall
amplitude of the test signal was normalized to full scale (0 dBFS). This test signal
is sent to the system and recorded at the output. We then examine the modulation
components of the 7 kHz tone caused by the 60 Hz tone.
Figure 37 shows the IMD components of the MOTU and HiFiBerry DAC. The
components are spaced at 60 Hz intervals in the sidebands of the 7 kHz tone. It can
be seen that the components measured of the HiFiBerry DAC are slightly higher
than the components measured from the MOTU. Inserting the magnitude values
of the four first IMD components to Eq. (42), we obtain the 5th order IMD for the
MOTU audio interface and the HiFiBerry DAC. The results are shown in Table 6.
The results show that the IMD of the HiFiBerry DAC+ is 0.02% which is about ten
times as high as the IMD of the MOTU audio interface.
65
6.5
Latency
Latency is a key attribute of a real-time DSP system as it specifies the response time
between the systems input and output. The latency inducing components in the
implemented system are the network, the audio buffer sizes in Pure Data, the FFT
block size used in the convolution and the DAC latency. These latencies are briefly
described in Section 6.5.1.
6.5.1
Latency inducing System components
The Raspberry Pi 2 includes a 100 Mbit/s network interface which enables a theoretical minimum latency of 0.12 ms for a 1500-byte (the maximum size for a UDP
datagram in OS X) network packet. In practice however, the network speed also
depends on the protocol stack, the network switch and possible other traffic in the
network. The network speed can be easily measured, for example, by using the Unix
ping utility.
Pure Data specifies two kinds of buffers, a block size and an audio buffer size.
The block size defines the length of the sample buffer used inside Pure Data. The
audio buffer size specifies the length of the input and the output sample buffer of
the audio interface. According to the Pure Data documentation, these buffer sizes
are linked in a way that the audio buffer size in milliseconds is rounded down to a
power of two block sizes [50]. The minimum possible block size in Pure Data is 64
samples. However, we examine whether this is true in Section 6.5.3.
In order to accommodate long impulse responses, the block size used in the
real-time convolution algorithm must be a number of factors larger than the default
block-size. As discussed in Section 2.3.3, the computational savings of fast convolution
only start to pay off for blocks larger than 128.
The DAC latency originates from the delay caused by the sample interpolation
filter. The HiFiBerry DAC+ uses a Texas Instrument 5122A chip that includes a
number of optional interpolation filters. One of the filters, referred to as the lowlatency IIR filter, is specifically designed for low-latency applications with only 3.5
samples of latency [48]. The latency of the MOTU ADC and DAC are not specified
in the MOTU Ultrasound Mk3 user’s manual.
6.5.2
Network latency
The system latency was measured using a set-up as described in Section 6.1. The
test signal is an impulse train with a frequency of 2 Hz, i.e. the signal consists of unit
impulses at 0.5 s intervals. The host PureData patch simultaneously reads the test
signal into the audio output buffer and writes the input signal to a file. Additionally,
we measured the network latency, and the I/O latency of the host computer and the
MOTU Ultralite audio interface.
The network latency between the Macbook and one of the Raspberry Pis was
measured using the Unix ping command. The packet size was set to 1500 bytes as
it is the size used by the ~bcast object. Using ping, we measured 100 round trip
latencies at one second intervals. The one way latency is obtained by simply dividing
66
Latency (ms)
0.8
0.75
0.7
0.65
0
10
20
30
40
50
60
70
80
90
100
Measurement
Figure 38: Network latency as a function of 100 measurements.
the measured values by two. The one way network latency is visualized in Fig. 38.
The latency fluctuates between measurements, but the average network latency is
calculated to be 0.718 ms roughly corresponding to 32 samples at 44.1 kHz sampling
rate.
6.5.3
Latencies caused by buffering
To ensure that the Pure Data objects responsible for file I/O ( readsf~ and writesf~)
read and store the same block size of samples at each DSP iteration. The patch was
first tested in a local loopback by simultaneously reading the test signal and writing
it to another file. The written file was identical to the test signal, proving that there
are no integral delays between the read and write processes.
The latency was measured with different buffer length combinations. There
following four buffers contribute to the overall latency.
1. The audio buffer size of the host Pure Data
2. The block size of the host patch
3. The audio buffer size of the receiving Pure Data
4. The block size of the receiving patch
We denote the host and receiver audio buffer sizes Sh and Sr , and the block sizes
Bh and Br , respectively. Figures 39a, 39b, 40a and 40b show the system latency in
number of samples for an impulse train of 1000 unit impulses. Unexpectedly, the
latency is not constant over time. Instead, it decreases linearly resetting at roughly
constant intervals resulting in a sawtooth shaped patterns. The amplitude of the
fluctuation is equal to Br in all measurements. The sawtooth shaped decrease in
latency might originate from misaligned sample buffers somewhere at the receiving
Pure Data patch.
The first two datasets were measured using a receiver patch containing one
bcreceive~ and dac~ object. The third dataset was measured with the normal
receiver patch containing the conv~ object in a subpatch that runs at 1024 block
size, but the bcreceive~ object is at the main patch with block size 64. In the last
67
850
1720
1700
800
Latency (samples)
Latency (samples)
1680
1660
1640
1620
1600
750
700
650
600
1580
1560
550
0
200
400
600
800
0
1000
200
400
600
800
1000
Measurement
Measurement
(a)
(b)
Figure 39: a) System latency when Bh = 1024, Br = 64, Sh = 5 ms (blue), Sr = 8 ms
(blue), Sh = 10 ms (orange) and Sr = 10 ms (orange), b) Sh = 5 ms, Sr = 8 ms,
Br = 64, Bh = 64 (blue) and Bh = 256 (orange).
1580
1700
1570
1680
1660
Latency (samples)
Latency (samples)
1560
1550
1540
1530
1520
1640
1620
1600
1580
1510
1500
1560
0
200
400
600
Measurement
(a)
800
1000
0
200
400
600
800
1000
Measurement
(b)
Figure 40: System latency when a) Bh = 64 and Br = 1024, b) Bh = 1024 and
Br = 128.
measurement, we used a similar patch as in the first two measurements except that
this time the bcreceive~ object is inside a subpatch with a block size of 128.
Figure 39a shows two measurements. The blue graph is measured using the
audio buffer sizes Sh = 5 ms and Sr = 8 ms whereas the orange graph is measured
using Sh = 10 ms and Sr = 10 ms. It can be seen how the offset is only equal to
approximately one block length which corresponds to the 2 ms increase in Sr . This
leads to the assumption that Pure Data calculates the audio buffer size according to
68
Table 7: Theoretical latencies of the different system components.
Network
Block size MAC
Block size RPi2
PD audio buffer MAC
PD audio buffer RPi2
HiFiBerry DAC+
samples
32
64
1024
192
320
4
time
0.73 ms
1.45 ms
23.22 ms
4.35 ms
7.26 ms
0.09 ms
Total
1572
35.65 ms
the equation
$
tfs
S = 64
64
%
(51)
Thus, we can assume that at 44100 Hz sampling rate 8 ms ≈ 320 samples and 10 ms
≈ 384 samples. Furthermore, this would indicate that the audio buffer size does not
affect the input latency in the host Pure Data. The output latency can be neglected
as the audio is transmitted via network instead of audio outputs.
In Fig. 39a, the block size of the host patch is increased from 64 (blue graph) to
256 (orange graph). As expected, the figure shows a constant increase in latency.
The increase in latency corresponds to the increase in the block sizes.
Figure 40a shows the latency when Bh = 64 and the convolution external is
running with a block size of 1024, i.e. Br = 1024. This is the typical use case of the
convolution system as the bcast~ external can transmit blocks of size 64 while the
conv~ external is already able to process over 1 s long impulse responses with a 1024
block size. The audio buffer size of the receiving Pure Data patch is 8 ms.
Figure 40b is an interesting case where Bh = 1024 and Br = 128. The figure
indicates that processing bcreceive~ in a larger block size results in higher latency
fluctuations. As mentioned earlier, the data in Fig. 40a was measured with the
bcreceive~ object within the main patch that uses a block size of 64. This leads to
the possible indication that the latency fluctuation may indeed originate from the
bcreceive~ patch.
6.5.4
Overall latency
Lastly, we summarize how the individual latencies of the system components add up
to the overall latency. We also examine how well the specified latencies correspond
with the measured results. Table 7 lists the main latency inducing components of
the DSP system in the case where the convolution is processed using 1024 block
size. The sample values are converted to milliseconds with fs = 44100 Hz. The Pure
Data audio buffer lengths are calculated according to Eq. (51). Comparing the total
69
samples (%)
20
15
10
5
0
e
eu
qu
it_
in
q_ urg ep
cf
e
ig
l
_s
os ore
nd nan
st
se
_
y_
pu
nc unk
_c ate
h
c
do
_l
t_
w
ex
lo
rs
q_ e_n ime
cf
t
ap
tre u_
x_
cp fiem
di
x_
_
ra
si
ck
lo
po
t
n_ ic_b _se
ru
r
r
ne time
ge
_
__ cpu
x_
si
l
po _kil
t
le
y
sk fda
ta
eo
m
tti
i
ge sw
a
_
at
or
ed
r
ct
u
ad
ve
/p
re
in
th
/b
l_
sr
le
/u cm
al
ar
_p
_p rm
nd
e
ild
rfo
pe
/s
_t
_t
nv
co
nv
co
e_
ild
Figure 41: The relative CPU time used by system and user processes. The graph
shows the 15 most time consuming process symbols.
latency to the graph shown in Fig. 40a, it can be observed that the graph starts
roughly at the same quantity of 1572 samples which roughly corresponds to 36 ms.
However, the fluctuation of the data makes it unclear whether other factors cause
alteration in latency, e.g. the variation of the network latency.
The linear decrease of the latency is an interesting feature. For now, we settle
for the interpretation that the buffering in the network externals cause the shift in
latency and leave further speculation for future research and development.
6.6
Code profiling
Latency is closely linked to computational complexity. If strict latency requirements
are set, the signal processing tasks must be modified so that the worst case computation time does not exceed the given latency specification. On the other hand, if the
latency specification is flexible the audio can be processed in larger buffers enabling
the processing tasks per sample. For example, FFT is relatively faster to compute
on longer frame sizes.
Therefore, it is worthwhile to use code profiling tools that report how the computational load is distributed not only among processes within the system, but also
among functions within programs and code lines within functions. In this thesis, we
used Oprofile [51], an open source system profiler for Linux, to obtain data on how
the CPU time is distributed while the convolution software runs.
We collected CPU usage data from user and kernel processes by running operf
with the system wide option for 2 minutes. The Pure Data patch was running
at the same time convolving two channels with 1 s long impulse responses. Figure 41 shows the fifteen tasks that consume the most CPU time. The graph
shows that the convolution algorithm conv_tilde_perform and its child thread
conv_tilde_parallel_thread together consume nearly 40 % of the total CPU
70
time. The symbol /snd_pcm represents the audio driver consuming roughly 10 %
of the CPU time. The symbol /usr/bin/puredata denotes Pure Data which uses
approximately 7.5 % of the overall CPU time. The rest of the CPU time is used on
the FFTW and standard C-library routines, as well as numerous kernel processes.
The rest of the bars in Fig. 41 are in fact various kernel processes.
Unfortunately, the operf tool does not tell the actual CPU cycles of the symbols.
Instead, it collects samples from the system that indicate the function the CPU is
currently working on. However, Oprofile can use the sample data to generate an
annotation that shows the relative CPU time spent in each line of code within a
source file. An annotation was generated for the source file conv~.c. According
to the annotation, roughly 98 % of the CPU time is spent computing the complex
multiplications required by the overlap-add algorithm. In order to improve the
algorithm performance by optimizing code, this would be the right place for it.
71
7
Summary
This section summarizes the thesis recapitulating the implementation choices and the
measurement results. Based on the results, we draw conclusions about the usability
and effectiveness of the DSP system and propose a variety of applications where the
system would prove useful. Lastly, the section discusses ideas for future research and
further development of the DSP software.
7.1
Conclusions
This thesis implemented a distributed real-time convolution system using Raspberry
Pi single board computers. The system transmits audio and parameter data using
Pure Data and audio over IP. We have discussed programming for Pure Data, optimal
real-time convolution algorithm design and code optimization for the ARM processor.
Furthermore, we have measured the latency and audio quality attributes of the
system. The source code is open-source under the Gnu GPLv3 license. The code is
available online at https://github.com/joniemi/dist-conv.
The thesis utilized the Raspberry Pi 2 Model B due to its good cost performance
ratio, widespread community, ease of use and the support for a low-cost DAC
add-on board. The Raspberry Pi 2 employs the general purpose ARM CortexA7 processor. Even though the processor is not specifically designed for DSP, it
offers subsidiary features, such as a fused multiply accumulate instruction and the
TM
NEON extension, that accelerate DSP tasks. Multiple processor cores can be
exploited to increase the overall algorithm performance, even though the current
multithreading implementation is inefficient due to the computational overhead
caused by the threading routines.
The implementation shows that modern low-cost single board computers are
capable of relatively strenuous DSP tasks in perceptual real-time. However, the
real-time processing capabilities are still infeasible for virtual acoustics applications
that convolve multiple impulse responses often longer than 1 s. With the implemented
system, a Raspberry Pi 2 is able to process two channels each convolved with a
1 s long impulse response with an I/O latency of 36 ms without buffer underflow.
Longer impulse responses or multiple sources can be processed but at the cost of
higher latency. Additionally, periodical latency drift occurs while the running the
system. The drift is supposedly caused by inaccurate buffering in the audio over IP
external objects.
The thesis measured the linear response, the THD and the IMD of the HiFiBerry
DAC+. The results show that the low-cost DAC-board for the Raspberry Pi can
indeed be considered high fidelity, thus enabling the implementation of professional
audio signal processing systems on the Raspberry Pi platform.
7.2
Applications
Despite the insufficient performance for the real-time virtual acoustics application,
the system has potential in other applications that benefit from portable and low-cost
72
real-time FIR filtering. One typical real-time FIR filtering application would be
a traditional or an adaptive room equalizer (EQ). Traditionally, room EQs are
used in order to compensate for the spectral colouring caused by the room and the
loudspeakers in studio and home environments. One of the problems of traditional
room equalizers is that the room impulse response is measured from a certain point in
the room and the compensation only works in that specific point. Adaptive equalizing
methods, on the other hand, constantly analyse the change in the room impulse
response and update the filter coefficients accordingly. A number of implementations
have been proposed in the literature [52, 53, 54, 55]. The system proposed in this
thesis, in conjunction with a measurement microphone, could potentially carry out
the signal analysis and the real-time inverse filtering tasks.
Furthermore, the system can be used to prototype more sophisticated applications,
such as head tracking and adaptive noise cancellation applications. The Integrated
Inter-Integrated Circuit (I2C) connectors enable additional sensors and circuit boards
to be attached to the SBCs. By attaching motion sensors to the I2C connectors, the
Raspberry Pi can then carry out the DSP required in interpolating the Head-Related
Transfer Functions (HRTF) based on the sensory data and convolving the audio with
the HRTFs. The Raspberry Pi could also be used as the DSP unit for an adaptive
noise cancellation system by attaching it to the required measurement microphones
and speakers. Due to its portability, the Raspberry Pi fits in tight spaces, for example
inside a ventilation duct.
Lastly, the convolution system can also be used in musical context as a convolution
reverberation effect within the aforementioned restrictions on the impulse response
length. Another potential musical application that requires FIR filtering is cabinet
modelling. The impulse response of a speaker cabinet contributes considerably to the
sound of electric guitars and other electroacoustic instruments. The system could
be used as is as a cabinet modelling effects unit or the convolution external can be
integrated in electronic music projects programmed in Pure Data.
7.3
Future research and development
A number of features and ideas are left for future development. First, the latency
drift issue should be solved, and the synchronization of the SBCs requires a test
method as the thesis did not measure the synchronization of multiple Raspberry Pis
or the impact of the network switch.
Second, due to the various inflexibilities of Pure Data, including the lack of
command line control and the complicated patch interface, it would be interesting
to implement the system using an alternative audio API, such as JACK, JUCE or
Portaudio. This would require the reprogramming of the Pure Data external source
code into the format of the chosen API. Additionally, APIs such as JUCE would
offer custom GUI tools and the option to build the application as a Virtual Studio
Technology (VST) plug-in that can be loaded in a digital audio workstation, such as
Ableton Live, Logic or FL Studio.
Third, Windows and MAX/MSP support could be included by adding extra code
to the Pure Data externals. Currently, the system is guaranteed to run on OS X and
73
Linux.
Fourth, as the SBC market is still booming with new kickstarter SBCs coming
out in the future, it would be interesting to measure the system performance on
alternative SBCs, such as Bela [56] or Pine A64 [57].
Finally, multiple processor cores could be better exploited by implementing a
more optimal multithreading solution, for example using a thread pool.
74
References
[1] V. Pulkki, J. Merimaa, and T. Lokki. Reproduction of Reverberation with Spatial Impulse Response Rendering. In AES 116th Convention, Berlin, Germany,
May 2004.
[2] S. Tervo et al. Spatial decomposition method for room impulse responses. J.
Audio Eng. Soc, 61(1/2):17–28, January 2013.
[3] S. Whalen. Audio and the Graphics Processing Unit. Author report, University
of California Davis, 47:51, 2005.
[4] J. A. Belloch et al. Real-time massive convolution for audio applications on
GPU. The Journal of Supercomputing, 58(3):449–457, 2011.
[5] Khronos Group, OpenCL overview. https://www.khronos.org/opencl. Accessed Jan. 4, 2016.
[6] NVIDIA Corporation, CUDA overview.
cuda-zone. Accessed Jan. 4, 2016.
https://developer.nvidia.com/
[7] JACK Audio Connection Kit | Home. http://jackaudio.org/. Accessed Feb.
11, 2016.
[8] JUCE, Discover JUCE. http://www.juce.com/discover. Accessed Jan. 5,
2016.
[9] Puredata, Pd Community site. https://puredata.info/. Accessed Jan. 5,
2016.
[10] S. M. Kuo, B. H. Lee, and W. Tian. Real-Time Digital Signal Processing:
Fundamentals, Implementations and Applications. John Wiley & Sons, Somerset,
NJ, USA, 3rd edition, 2013. ISBN: 9781118414323.
[11] S. K. Mitra. Digital Signal Processing: A Computer Based Approach. McGrawHill, Boston, 2nd edition, 2002. ISBN 0-07-252261-5.
[12] C. E. Shannon. Communication in the Presence of Noise. Proceedings of the
IRE, 37(1):10–21, 1949.
[13] C. S. Burrus and T. W. Parks. DFT/FFT and Convolution Algorithms: Theory
and Implementation. Wiley, New York, 1985. ISBN: 0-471-81932-8.
[14] J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of
Complex Fourier Series. Mathematics of Computation, 19(90):297–301, 1965.
[15] E. C. Ifeachor and B. W. Jervis. Digital Signal Processing: A Practical
Approach. Addison-Wesley, Wokingham, 1993. ISBN 0-201-54413-X.
75
[16] J. L. Harrington. Ethernet Networking for the Small Office and Professional
Home Office. Morgan Kaufmann Publishers-Elsevier, 30 Corporate Drive, Suite
400, Burlington, MA 01803, US, 2007. ISBN-10 0-12-373744-3.
[17] M. M. Alani. Guide to OSI and TCP/IP Models. Springer Briefs in Computer
Science. Springer, 2014. ISBN: 978-3-319-05151-2.
[18] ISO/IEC 7498-1. Information technology - Open Systems Interconnection Basic Reference Model: The basic model, 1994.
[19] D. Reynders and E. Wright. Practical TCP/IP and Ethernet networking.
Newnes-Elsevier, Linacre House, Jordan Hill, Oxford OX2 8DP, United Kingdom,
2003. ISBN: 07506 58061.
[20] J. Postel. RFC 793. Transmission Control Protocol, 1981.
[21] J. Postel. RFC 768. User Datagram Protocol, 1980.
[22] RFC 791. Internet Protocol, 1981.
[23] Rekhter et al. RFC 1918. Address Allocation for Private Internets, February
1996.
[24] J. Mogul. RFC 919. Broadcasting Internet Datagrams, October 1984.
[25] Bouillot et al. Best Practices in Network Audio. J. Audio Eng. Soc, 57(9):729–
741, September 2009. http://www.aes.org/e-lib/browse.cfm?elib=14839.
[26] Specification of The Digital Audio Interface (The AES/EBU interface), 2004.
https://tech.ebu.ch/docs/tech/tech3250.pdf. Accessed April 13, 2016.
[27] C. Ortmeyer. Then and Now: A Brief History of Single Board Computers,
2014.
http://www.element14.com/community/servlet/JiveServlet/
previewBody/68547-102-2-296024/A%20Brief%20History%20of%20Single%
20Board%20Computers.pdf. Accessed Jan. 21, 2016.
[28] R. Roy and V. Bommakanti. ODROID-C1 User Manual. Hard Kernel, Ltd.,
704 Anyang K-Center, Gwanyang, Dongan, Anyang, Gyeonggi, South Korea,
2015.
[29] ARM Limited, 10 Fulbourn Road Cambridge, CB1 9NJ, England.
Cortex-A Series Programmer’s Guide, 2013. Version 4.0.
ARM
[30] C. J. Hughes. Single-Instruction Multiple-Data Execution. Synthesis Lectures
on Computer Architecture. Morgan & Claypool Publishers, 2015. ISBN:
978162705764.
[31] BeagleBone Black wiki, BeagleBone Black Description. http://elinux.org/
Beagleboard:BeagleBoneBlack#BeagleBone_Black_Description. Accessed
Jan. 21, 2016.
76
[32] CubieBoard, cubieboard Docs. http://docs.cubieboard.org/products/
start#cubieboard2.Accessed Jan. 21, 2016.
http://www.
[33] ODROID | Hardkernel, ODROID-C1+ Technical details.
hardkernel.com/main/products/prdt_info.php?g_code=G143703355573.
Accessed Jan. 20, 2016.
[34] Raspberry Pi Foundation, Raspberry Pi 2 model B. https://www.raspberrypi.
org/products/raspberry-pi-2-model-b/. Accessed Jan. 21, 2016.
[35] HiFiBerry – High Quality Raspberry Pi Audio. https://www.hifiberry.com/.
Accessed July 6, 2016.
[36] J. Pais and A. Hyde.
Pure Data.
Floss Manuals, 2012.
http://
en.flossmanuals.net/pure-data/_booki/pure-data/pure-data.pdf. Accessed Feb. 1, 2016.
[37] J. M. Zmölnig. HOWTO write an External for Pure Data, 2014. Available:
http://pdstatic.iem.at/externals-HOWTO/pd-externals-HOWTO.pdf.
[38] GNU Project - Free Software Foundation, The GNU Compiler Collection.
https://gcc.gnu.org. Accessed March 23, 2016.
[39] M. Frigo and S. G. Johnson. FFTW. Massachusetts Institute of Technology,
2003.
[40] The benchFFT Home Page. http://www.fftw.org/benchfft/. Accessed Feb.
15, 2016.
[41] Remu, netsend . http://www.remu.fr/sound-delta/netsend~/. Accessed
April 4, 2016.
[42] B. P. Douglass. Real-time design patterns: robust scalable architecture for
real-time systems. Addison-Wesley, Boston (MA), 2003. ISBN 0-201-69956-7.
[43] S. Temme. Audio Distortion Measurements. Brüel & Kjær Application Note,
1992. BO 0385-11.
[44] M. Clara. High-Performance D/A-Converters. The Springer Series in Advanced
Microelectronics. Springer, Berlin Heidelberg, 2013. ISBN 978-3-642-31228-1.
[45] G. Stan, J. Embrechts, and D. Archambeau. Comparison of Different Impulse
Response Measurement Techniques. J. Audio Eng. Soc, 50(4):249–262, April
2002.
[46] A. Farina. Simultaneous Measurement of Impulse Response and Distortion with
a Swept-Sine Technique. In AES 108th Convention, Paris, France, February
2000.
77
[47] A. Farina. Advancements in Impulse Response Measurements by Sine Sweeps.
In AES 122th Convention, Vienna, Austria, May 2007.
[48] Texas Instruments, PO Box 655303, Dallas, Texas 75265. PCM512x 2-VRMS
DirectPathTM , 112/106-dB Audio Stereo DAC With 32-bit, 384-kHz PCM Interface (Rev. B), January 2016.
[49] SMPTE RP 120-1994. Measurement of Intermodulation Distortion in MotionPicture Audio Systems, January 1994.
[50] Pd Community Site, Pd Documentation chapter 3: Getting Pd to run. https:
//puredata.info/docs/manuals/pd/x3.htm.
[51] Oprofile – A System Profiler for Linux. http://oprofile.sourceforge.net/
news/. Accessed July 7, 2016.
[52] P. G. Craven and M. A. Gerzon. Practical Adaptive Room and Loudspeaker
Equaliser for Hi-Fi Use. In AES UK 7th Conference: Digital Signal Processing
(DSP), London, UK, September 1992.
[53] J. Leitão, G. Fernandes, and A. J. S. Ferreira. Adaptive Room Equalization in
the Frequency Domain. In AES 116th Convention, Berlin, Germany, May 2004.
[54] A. Leite and A. J. S. Ferreira. An Improved Adaptive Room Equalization in
the Frequency Domain. In AES 118th Convention, Barcelona, Spain, May 2005.
[55] A. Rocha et al. Adaptive Audio Equalization of Rooms based on a Technique
of Transparent Insertion of Acoustic Probe Signals. In AES 120th Convention,
Paris, France, May 2006.
[56] Bela, The Embedded Platform for Ultra-Low Latency Audio and Sensor Processing. http://bela.io/. Accessed July 6, 2016.
[57] Pine A64, First $15 64-Bit Single Board Super Computer.
Kickstarter.
https://www.kickstarter.com/projects/pine64/
pine-a64-first-15-64-bit-single-board-super-comput.
Accessed
July 6, 2016.
78
A
Build instructions
This Appendix describes the makefiles that build the Pure Data external objects and
the instructions on how to install the tools and libraries used in this thesis.
A.1
Makefile for bcast and bcreceive
# ----------------------- MAX OS X ------------------pd_darwin: bcreceive~.pd_darwin bcast~.pd_darwin
.SUFFIXES: .pd_darwin
DARWINCFLAGS = -DPD -DUNIX -DMACOSX -O2 \
-Wall -W \
-Wno-unused -Wno-parentheses -Wno-switch
#DARWININCLUDE =
-I../../src -Iinclude
.c.pd_darwin:
cc $(DARWINCFLAGS) $(DARWININCLUDE) -o $*.o -c $*.c
cc -bundle -undefined suppress -flat_namespace \
-o $*.pd_darwin $*.o
rm -f $*.o ../$*.pd_darwin
ln -s $*/$*.pd_darwin ..
# ----------------------- LINUX ARMv7 ---------------------pd_linux: bcreceive~.pd_linux bcast~.pd_linux
.SUFFIXES: .pd_linux
LINUXCFLAGS = -DPD -DUNIX -DHAVE_LRINT -DHAVE_LRINTF -O2 \
-funroll-loops -fPIC -fomit-frame-pointer \
-march=armv7ve -mtune=cortex-a7 -ftree-vectorize \
-mfpu=neon -mfloat-abi=hard \
-Wall -W -Wshadow -Wstrict-prototypes \
-Wno-unused -Wno-parentheses -Wno-switch
LINUXINCLUDE =
-I/usr/include/pd
.c.pd_linux:
arm-linux-gnueabihf-gcc $(LINUXCFLAGS) $(LINUXINCLUDE) \
-o $*.o -c $*.c
ld -export_dynamic -shared -o $*.pd_linux $*.o -lc -lm
79
rm $*.o
A.2
Makefile for conv
# ----------------------- LINUX ARMv7 ---------------------pd_linux: conv~.pd_linux
.SUFFIXES: .pd_linux
LINUXCFLAGS = -DPD -DUNIX -DHAVE_LRINT -DHAVE_LRINTF -DINPLACE \
-DTHREADS -O2 \
-funroll-loops -fPIC -fomit-frame-pointer \
-march=armv7ve -mtune=cortex-a7 -ftree-vectorize \
-funsafe-math-optimizations -mfpu=neon -mfloat-abi=hard \
-Wall -W -Wshadow -Wstrict-prototypes \
-Wno-unused -Wno-parentheses -Wno-switch -Wextra
LINUXINCLUDE =
-I/usr/include/pd
.c.pd_linux:
arm-linux-gnueabihf-gcc $(LINUXCFLAGS) $(LINUXINCLUDE) \
-o $*.o -c $*.c
ld -export_dynamic -shared -o $*.pd_linux $*.o -lm -lc \
-lfftw3f
rm $*.o
A.3
benchfftw
./configure --enable-single --build=arm-gnueabihf \
CC="arm-linux-gnueabihf-gcc" \
CFLAGS="-march=armv7ve -mtune=cortex-a7 -mfpu=neon\
-mfloat-abi=hard -ftree-vectorize -funsafe-math-optimizations"
make
sudo make install
A.4
fftw3
./configure --with-slow-timer --enable-single --enable-shared \
--enable-neon --build=arm-gnueabihf CC="arm-linux-gnueabihf-gcc \
-march=armv7ve -mtune=cortex-a7 -mfpu=neon -mfloat-abi=hard \
-ftree-vectorize -funsafe-math-optimizations"
make
80
sudo make install
A.5
Oprofile
./configure
make
sudo make install
Note that Oprofile requires packages binutils-dev, libiberty-dev and popt. Building
Oprofile does not require any special configure flags.
81
B
Raspbian configuration
This Appendix discusses how to configure the Raspbian operating system to recognize
the HiFiBerry DAC+, use a static fallback IP address profile, overclock the CPU
and adjust the CPU frequency scaling settings.
B.1
HiFiBerry DAC+
The HiFiBerry DAC+ add on board is not recognized by Raspbian until two configuration steps are made. First, insert the line dtoverlay=hifiberry-dacplus in
the system configuration file /boot/config.txt. The on-board sound device can be
disabled by removing or commenting out the line dtparam=audio=on. Second, edit
or create the configuration file /etc/asound.conf by writing the following script.
pcm.!default{
type hw card 0
}
ctl.!default{
type hw card 0
}
After rebooting the Raspberry Pi, the HiFiBerry DAC+ should show in the list when
executing aplay -l.
B.2
Setting up a static IP address
In order to keep track of the IP addresses of the Raspberry Pi, it might be useful to
individually assign a static IP address for each Raspberry Pi. At the same time, it is
convenient to enable the Raspberry Pi to lease a dynamic IP address from an ISP
when connected to the Internet. This is achieved by configuring a static IP address
profile for the Ethernet interface. The Raspbian automatically falls back to this
profile if the dynamic host configuration protocol fails ,i.e. an Internet connection
is unavailable. The static profile is created by inserting the following lines to the
configuration file /etc/dhcpcd.conf
profile static_eth0
static
ip_address 192.168.1.11/24
static
routers 192.168.1.1
static
domain_name_servers=192.168.1.1
interface eth0
fallback static_eth0
If the network is specified as 192.168.1.0, the IP address can be anything between
192.168.1.2 and 192.168.1.254. The /24 denotes the network mask.
82
B.3
Overclocking
The simplest way to overclock the Raspberry Pi is to use the raspi-config tool.
The tool can be ran in the command line and it presents a menu including the option
“Overclock”. The submenu presents two overclocking presets, “None” and “High”.
“None” is the default setting which sets the maximum CPU clock speed to 900 MHz
and SDRAM clock speed to 450 MHz. The “High” setting raises the CPU frequency
to 1000 MHz and the SDRAM frequency to 500 MHz.
Using the preset is safe as it has been officially tested to be stable. Users are
also able to experiment with other overclocking settings by modifying the system
configuration file /boot/config.txt. For example, similar overclock settings to the
“High” preset can be achieed by typing the following lines in the configuration file.
arm_freq=1000
sdram_freq=500
over_voltage=2
B.4
CPU frequency scaling
Some speed up can be achieved by changing the system CPU frequency scaling governor from “ondemand” to “performance”. The setting can be changed during runtime
by editing the file /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor.
This must be done separately for each core.
The “ondemand” setting preserves the CPU and decreases power consumption
by lowering the CPU frequency at lower system loads and allowing maximum clock
speed only when a certain CPU load threshold is passed. The “performance” setting
enables maximum clock speed at all times. It also prevents performance dips during
load spikes caused by the potentially lagging frequency governor.
B.5
Enabling the real-time scheduler
In order to enable real-time scheduling priority for Pure Data, users can edit the
configuration file /etc/security/limits.conf. The following lines should be added.
<username>
<username>
rtprio
memlock
90
unlimited
However, it is unclear whether simply running Pure Data with the real time flag -rt
serves the same purpose.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement