DUNCALF Thomas Edward

DUNCALF Thomas Edward
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Acknowledgements
Firstly, many thanks to my supervisor, Dr. Kia Ng, for his ideas, feedback and most of all his patience
in assisting me with this project, and to Derek Magee for initially advising me on the suitability of
such a project.
Thanks also to my housemates for keeping me going at 6 a.m. and for giving me some feedback.
i
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Table of Contents
Acknowledgements ...................................................................................................................i
Table of Contents .....................................................................................................................ii
1
Introduction......................................................................................................................2
1.1
Problem definition .....................................................................................................2
1.2
Project aims................................................................................................................2
1.3
Minimum requirements and possible extensions.......................................................3
1.4
Schedule.....................................................................................................................3
1.5
Report Outline............................................................................................................4
2
Background Research......................................................................................................6
2.1
Definitions..................................................................................................................6
2.2
Programming considerations .....................................................................................7
2.2.1
Candidate Programming Languages ..................................................................7
2.2.2
API’s for audio and graphics ...........................................................................11
2.3
Drums.......................................................................................................................11
2.4
Capturing Digital Audio ..........................................................................................12
2.5
Onset detection.........................................................................................................13
2.6
Drum separation/classification.................................................................................14
2.7
E-learning systems ...................................................................................................17
3
Design and Implementation ..........................................................................................18
3.1
Methodology ............................................................................................................18
3.2
Software Architecture ..............................................................................................19
3.3
User Interface...........................................................................................................20
3.4
Audio input ..............................................................................................................24
3.5
Onset detection.........................................................................................................25
3.6
Drum classification ..................................................................................................27
3.7
Timing analysis........................................................................................................31
3.8
Audio Player ............................................................................................................33
4
Evaluation.......................................................................................................................34
4.1
Evaluation methodology ..........................................................................................34
4.2
Onset detection accuracy .........................................................................................34
4.3
Classification accuracy ............................................................................................38
4.4
General observations................................................................................................43
5
Conclusions.....................................................................................................................44
5.1
Satisfaction of requirements ....................................................................................44
5.2
Success of the project...............................................................................................45
5.3
Possible improvements to accuracy .........................................................................46
5.4
Recommendations for further development and research .......................................46
6
Bibliography ...................................................................................................................48
Appendix A: Reflection .........................................................................................................50
Appendix B: Informal feedback ...........................................................................................52
Appendix C: The pattern file format ...................................................................................55
Appendix D: Evaluation notes ..............................................................................................56
ii
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
1 Introduction
1.1 Problem definition
Music lessons are a popular pastime, particularly amongst school pupils (according to [16], 8% of
compulsory school aged pupils received regular tuition in 2002), and drums are one of the most
popular instruments to learn, along with more traditional instruments such as the violin and flute, and
instruments popular in contemporary music such as the guitar and keyboard [16]. Typically,
individual music lessons are a weekly event, with the time between lessons intended to be used for
practicing. However, speaking from experience, it is often the case that practice is seen as something
of a hassle, and is not performed as regularly as it could be.
Without proper practice, learning a new instrument can be frustratingly slow, particularly when
feedback is only received once a week. Computer software to enable interactive practice outside of
lessons, allowing pupils to constantly receive some feedback on their playing and to progress to more
advanced practice pieces as appropriate, could clearly have a beneficial effect upon their progress by
making practice both more useful and more enjoyable. Tuition systems which take advantage of the
MIDI capability of modern electronic keyboards to allow the input to be easily captured and analysed
exist for instruments such as the piano (for example, Voyetra’s “Teach Me Piano” or Musicalis’
“Interactive Piano Course”). However, it would seem that no equivalent software exists for the drums
(the majority of drum tuition courses are on DVD). MIDI drum kits are not at all commonplace, and
so a suitable piece of interactive e-learning software for drum players should be able to use an audio
input in place of a MIDI input.
1.2 Project aims
The stated aim of the project is to design and develop a piece of interactive e-learning software which
provides training for drummer to play a set of pre-defined drum patterns. The system will track and
analyse the users playing, and provide feedback to them.
The stated objectives are to create a piece of software that will:
•
capture a drum beat as an input, and extract from this timing of individual hits
•
extract timing information from a pre-defined drum pattern
•
use an algorithm to analyse timing differences between the two inputs
2
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
•
use the output of this algorithm, combined with a suitable graphical user interface, to
provide real time feedback to the player regarding the timing accuracy of their playing
•
provide a graphical representation of how the user's playing compared to the ideal playing
of that pattern.
1.3 Minimum requirements and possible extensions
The minimum requirements of the project are:
•
The software must have a GUI which allows the user to control the software, and to
visualise how well their playing matched the pre-defined pattern.
•
The software must be able to play back both the pre-defined pattern and the pattern as
played by the user.
•
The software must use beat tracking techniques to extract the rhythmical pattern played
by the user.
•
The software must use suitable analysis techniques to analyse the accuracy of the user's
playing, and provide useful feedback based upon this.
The possible extensions to these requirements are:
•
To analyse the input as captured audio, and extract from this the individual drum hits, i.e.
separate out bass drums, snares and hi-hats, using real-time filtering.
•
Functionality to detect when a user is ``struggling'' with a certain aspect of the playing,
and to provide relevant exercises to practice this (e.g. a simple timing exercise).
•
Keep a record of the users performance over time, and use this information to provide a
progress report.
1.4 Schedule
The original schedule for the project from the mid-project report is presented below.
20/10/2005 Complete preliminary investigation
1/12/2005 Complete preliminary background reading
16/12/2005 Submit mid-project report
23/1/2006 Complete development of preliminary software architecture
3
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
1/2/2006
3/3/2006
10/3/2006
Complete any necessary modifications to plan based upon return
of mid-project report
Complete TOC and draft chapter for checking over with
supervisor
Submit TOC and draft chapter, software demo for progress
meeting near completion
17/3/2006 Progress meeting
20/4/2006 Complete software
25/4/2006 Complete project report
2/5/2006 Submit project report
The schedule was revised following advice received in the mid-project report, to contain more
specific milestones relating to the content of the project, but the schedule was not rigidly adhered to
throughout the project due to certain aspects of the solution taking a good deal more time than
anticipated, resulting in some work not being completed for deadlines. For more comments on the
time management of the project, see Appendix A.
1.5 Report Outline
The project report is divided in to four main sections:
•
Background research – a summary of background research carried out in the relevant
areas
•
Design and implementation – a description of the process of writing the software and
how the functionality of the prototype was achieved
•
Evaluation – an evaluation of the performance and accuracy of the prototype software
•
Conclusion – a summary of the degree to which the project performed and met the
minimum requirements, and discussion of improvements that could have been made to
enable the software to perform better
In addition, there are five appendices:
•
Reflections – personal reflections on the success of the project and the lessons learned
from creating it
•
Informal feedback – feedback received from friends regarding the software
4
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
•
Pattern file format – a description of the file format used for storing drum pattern
information
•
Evaluation notes – additional notes concerning details of the evaluation process
•
Revised schedule – the revised schedule for the project
5
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
2 Background Research
In order to produce a satisfactory solution to the problem, background research needed to be carried
out in a number of areas. The major one was that of detecting and classifying the drum beats played,
an area in which there already exists a good body of research – however, it was important to take in to
account the limitations of the proposed system (for example, the assumption that the captured input of
the drums will be fairly free of background noise, and of reasonably good, consistent quality) and the
need for real time interaction. In addition to this, it was necessary to decide upon the best
programming language to develop the software in, and to consider the design of the user interface and
the ways in which a user can interact with the system to increase its usability.
2.1 Definitions
In order to aid better understanding of this report, it may be necessary to define certain terms used:
•
Attack: the period of the envelope (see below) of a sound during which the volume is
increasing (see Figure 2.3. for an illustrated example).
•
API: Application Programming Interface – a library of function calls providing high level
access to lower level system functionality, for example audio or graphics routines.
•
Bar: a segment in time of a musical piece measured by a number of beats. The number of
beats in a bar is defined by the time signature – for example, music in a 4/4 time signature
has 4 beats to each bar, in which each beat is represented by a quarter note.
•
Envelope: the envelope of an audio signal refers to the volume changes in the sound –
similar to the representation of a sound viewed in wave editor software. For the sake of
simplicity, only Attack-Release envelopes are considered here.
•
FFT: Fast Fourier Transform, an algorithm used to determine periodic components of
input data – e.g. to determine the frequency spectrum of a sound.
•
Latency: in audio software, latency refers to the time between an audio event being
received and it being processed, or between a sound being triggered and it being played.
•
Onset detection: the splitting of an audio signal in to discrete events – in this case, drum
hits (the terms “beat detection” and “onset detection” are used interchangeably).
•
Release: the period of the envelope of a sound during which the volume is decreasing,
after having increased (again, see Figure 2.3. for an illustrated example).
6
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
2.2 Programming considerations
2.2.1 Candidate Programming Languages
There were a number of candidate programming languages considered for the implementation of the
software – some of which are used for a wide variety of software, and others specifically designed for
audio processing. The requirements for a suitable language were that it was:
•
easy to use – i.e. it offers a sensible structure and syntax, good documentation, and it
facilitates Rapid Application Development to some degree (for example, with user
interface design tools and inline auto-completion).
•
popular (particularly in terms of other projects developed in the language and information
available online or in printed form) – using a language which was not widely used could
prove problematic in terms of learning the language and understanding errors.
•
fast (in terms of execution speed) – in the early stages, the processing requirements of the
software could only be very roughly approximated, and due to the real-time nature of the
system, a programming language which was slow to execute (for example, an interpreted
language) may later be found not to be sufficient.
•
flexible – the software needed to contain the audio input and processing functions, the
playback and analysis logic, and the graphical user interface. It was decided that it was
preferable to avoid using different programming languages for different elements of the
functionality of the software, as doing so could lead to difficulties in debugging and
interfacing between components, so the chosen language needed to be able to encompass
all of this functionality.
Audio specific environments: MAX (Max/MSP, JMax, etc.), PureData and SuperCollider
There exist a variety of programming environments designed for audio (or more generally,
multimedia) processing. MAX (and its decedents, for example Max/MSP and JMax) and PureData are
both graphical environments intended for the processing mostly of audio, but also of video and
graphics. A “patch” (the equivalent of a program, or an object in a program) is made up of “modules”
(including audio input/output modules, data container modules and processing modules) which are
interconnected in a graphical fashion (as shown in figure 2.1). SuperCollider serves a similar purpose
without the graphical interface – programs are written in SC, a “Smalltalk derived language with Clike syntax” [8], and the capabilities of the environment can be expanded through the use of a C
plugin API to create “unit generators” serving a specific function.
7
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Figure 2.1: An screenshot of a PureData patch to generate a sequence of random notes.
Such an environment provides a rapid platform for creatively developing audio software, due partially
to the easy availability of modules for common tasks such as sound input and output, and offers an
easy way to deal with allowing the software to perform in real time. Both MAX/MSP and
SuperCollider have been used in research projects involving onset detection, for example [10, 19].
However, this work has focussed mostly upon the audio processing aspect of the problem, and
development could be restricted, for example by the inability to access API’s which may be useful in
other areas such as the user interface, or some features may require interfacing with other
programming languages – designing a solution which combined both elements from
MAX/PD/SuperCollider and a conventional programming language may be more time consuming
than creating the entire software in a “traditional” programming language.
In addition, a lack of familiarity with the environment and its capabilities and the concern that
developing a complex application in a graphical manner would quickly become unwieldy influenced
the decision not to use such an environment.
C++
The C++ language offers many of the desired features of a language for such a solution, and its
suitability for such tasks has been time proven, with the language having first been drafted in 1980
[25] and the vast majority of audio and gaming software in use today being written in the language
(“C++ is the language of choice among game programmers”, [11]). In particular, it is widely regarded
8
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
as one of the fastest programming languages around in terms of execution speed, it is immensely
popular (for proof of this, see the vast number of online communities dedicated to C++ programming)
and it’s flexibility cannot be questioned, thanks to the wide array of API’s designed to work natively
with the language (and often coded in it as well) – for example, DirectX and SDL.
C++ was the initial choice of language for the project – however, on exploring the available resources
– for example, the Microsoft DirectX SDK [21] documentation – it was apparent that a lot of the
function calls were not entirely straightforward, and the languages approach to object orientation was
not as pervasive as that of, say, Java. From studying code examples, it can be seen that sometimes
C++ seems to represent the “hard way” of coding a solution to a problem when compared to a
language such as Java, and therefore other, newer languages may provide a better environment in
which to code the software.
Java
Java is a “strongly typed, object-oriented language with a C-style syntax” [38], developed with object
orientation in mind from the ground up, and as such tends to have relatively clean and sensible syntax
conventions, for example in the naming of methods and libraries. The language was first released by
Sun Microsystems in 1995 [29] and as such, has been in use for over ten years now, and has achieved
wide spread adoption, particularly in the business sector (thanks to the rapid application development
the language allows, and the robustness of running code inside a virtual machine) and the mobile
sector (virtually every mobile handset released now features Java technology for gaming).
However, multimedia use of Java is less widespread. Projects such as Jake2 (Java Quake 2) by
Bytonic Software [32] and Processing [14], a programming language for interactive graphics based
upon Java, have successfully demonstrated that Java is capable, with the right coding, of good
graphical performance. Audio software written in Java is less widespread however, and in particular
real time audio processing software written in Java is not at all common (notable exceptions include
JSyn, JMax and JAss) - the lack of a mature, well established API for real time audio input and output
is likely a contributing factor towards this.
Performance of Java has, in the past, had a bad reputation, particularly because the process of
initializing the virtual machine in which the software runs was very slow. Performance is now less of
a concern (although it can be observed that a Java application often does not feel as responsive as a
natively written equivalent), and it may well have been an ideal language to write the software in,
were it not for the lack of well documented audio input and output functions – a large part of the
solution – and the apparent lack of interest in using Java for audio applications.
9
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
C#
C# (pronounced “C-sharp”) is a relatively new programming language, the first specification of which
was submitted to the ECMA by Microsoft in 2000 [1]. Despite the name, it has more in common with
Java than the C/C++ family of languages – it is also strongly object oriented and typed, and also runs
inside a virtual machine.
In the case of C#, the virtual machine is known as the Common Language Runtime (CLR), which
“provides services such as JIT compilation, memory management … and integrated security and
permission management” [1], therefore avoiding many of the common pitfalls in C++ such as
memory leaks (handled by the integrated garbage collection) and security vulnerabilities due to errors
such as unchecked buffer overruns. The CLR is part of the Microsoft .NET framework, which is a key
concept in C# (and .NET programming in general) programming. The framework offers both
language and platform independence, and also provides class libraries, known as the Framework
Class Library [1], which facilitates common programming tasks such as string manipulation, file I/O
and user interface rendering.
The availability of these mature and well documented classes, combined with the straightforward,
familiar syntax and reported good execution speed [37] make C# an ideal language for rapid
development of software. With it being quite a young language, there does not currently exist a great
deal of audio software developed using C#, but it is being pushed by Microsoft as the future of game
programming, and the availability of a “managed” (i.e. designed for the .NET framework) version of
DirectX would appear to make it an ideal candidate language, and an interesting study in the
performance of this language for such tasks.
Development Environment
Having decided upon C# as the language in which to program the software, it was necessary to decide
upon a development environment to use. Microsoft offer a free “Express” version of their popular
Visual C# 2005 product [22], with very few restrictions compared to the full version, and so this was
the obvious choice for the development environment.
10
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
2.2.2 API’s for audio and graphics
The majority of programming languages (with the exception of the audio specific environments
discussed in 2.2.1) do not natively provide a great deal of support for audio input and output and
graphics. C# offers a technology known as GDI+ for graphics presentation, but early testing suggested
that, whilst the software is by no means graphically intensive, the overhead of using GDI+ to update
graphics in real time was unacceptable. Natively, there is very little support for audio input and
playback – the simplest route for providing such functionality is to include the Windows multimedia
API, winmm.dll, but again this is not designed for real time performance.
There exist a number of application programming interfaces (API’s) which offer either graphics or
sound functionality, or in some cases both, such as PortAudio for audio [7], SDL for graphics and
audio [36] and OpenGL for graphics [24], but on the Windows platform the most mature and widely
used technology is Microsoft DirectX. DirectX has been included with versions of Windows from
2000 onwards, and provides support for the vast majority of graphics, audio and input hardware,
allowing easier and faster access to their functionality.
For graphics programming, two technologies are provided, DirectDraw and Direct3D (with
DirectDraw providing 2D only functionality). In the Managed DirectX framework (designed
specifically for the .NET framework), the Direct3D API is considerably more advanced than that of
DirectDraw, and so Direct3D offers a better solution for the graphics element of the software, despite
it only utilising 2D graphics.
Two technologies also exist for audio programming – DirectMusic and DirectSound – with
DirectMusic seemingly being phased out in favour of DirectSound as it is not included in the
Managed DirectX API’s. DirectMusic also offers no input functionality [20], and so DirectSound was
chosen as the audio API for the project.
2.3 Drums
A typical drum kit consists of at least a bass (or kick) drum, a snare drum, several tom-toms, a hi-hat,
a ride cymbal and a crash cymbal. For the purposes of simplifying the project, it was decided to take
in to account three of these drums – the bass drum, the snare drum and the hi-hat, in both the open and
closed positions.
11
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
The characteristics of the chosen drums are as follows:
•
Bass drum – the bass drum is a large, floor standing drum, struck with a pedal operated
mallet, producing a low pitched sound. The bass drum typically drives the beat of a song.
•
Snare drum – the snare drum is usually mounted at knee height, and consists of a metal
(or wooden) cylinder with a “skin” stretched over it. The skin is struck with a drum stick
to produce a short, sharp sound, often heard on the 2nd and 4th beats of the bar.
•
Hi-hat – the hi-hat is mounted on a stand with a pedal at the bottom. The drum itself
consists of two metal cymbals, one on top of the other. In the open position, the two
cymbals are separated, and striking the drum produces a sustained, metallic sound. The
pedal is used to adjust the distance between the two cymbals, and when it is fully pressed,
the hi-hat is in the closed position and striking it produces a shorter, more subtle sound.
2.4 Capturing Digital Audio
As mentioned in 1.1, the ideal solution (in terms of ease of programming) to the problem of capturing
the users input would be to capture the input from a MIDI drum kit, but in order to increase the
usability of the system and remove the need for specialist MIDI input equipment, it is desirable to be
able to capture an audio input and extract the drum pattern played by the user from this.
In order to represent an audio signal digitally, we must first be able to provide a representation of the
sound to the computer. Typically, sound – which is in reality made up of waves of pressure travelling
through the air [33]– is captured by a microphone, in which the sound waves hit a diaphragm and are
converted in to electrical signals [33], before travelling down a cable as an analogue signal and
reaching the input of some recording device. To capture audio digitally, the signal cable connects to
an input of the computer’s soundcard, which contains an analogue-to-digital converter, used to
convert the analogue input in to digital form. In order to do so, the level of the analogue input (which
can be visualised as a continuous wave) must be sampled to a certain degree of precision at regular
intervals, as illustrated below.
Digital
representation
sampling
Original audio
signal – a sine
wave
Sampling
resolution –
discrete levels
for each
sample
Sampling rate – samples recorded at regular intervals
Figure 2.2: An illustration of the sampling process
12
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
The interval between each sample being recorded is known as the sampling rate, and is typically
expressed in Hertz or Kilohertz. Typical sampling rates range from 11,025Hz (for low bandwidth
signals such as speech), through 44,100Hz (as used for audio CD’s – due to the effects of the Nyquist
theorem [2], this sampling rate allows for signals up to around 22khz to be captured, covering the
entire estimated human range of hearing [13]), up to 96,000Hz (as used by the new DVD-Audio
format).
The precision to which the input level is recorded at each sampling interval is known as the sampling
resolution. Earlier soundcards, such as the original SoundBlaster, used only an 8-bit sampling
resolution, providing only 256 possible levels for each sample, but today a 16-bit sampling resolution
is most commonly used, and is the CD Audio standard, providing 16,834 possible values for each
sample. In addition to these two parameters, an input signal is typically either captured in mono (i.e. 1
channel) or stereo (i.e. 2 independent channels for the left and right signals).
2.5 Onset detection
To a human, the task of detecting beats in a piece of music is trivial. We have the ability to
differentiate between a beat and other elements of the sound, and our natural ability for pattern
matching allows us to follow the rhythm of even a complex beat without actively thinking about it –
for example, idly tapping along to a piece of music [12]. However, it is not so trivial for a computer to
do the same, as what seems obvious to a human (a drum sound in amongst a variety of other
instruments, or the repetitive nature of a drum rhythm) can seem chaotic when represented digitally.
There has been a great deal of research in to the area of onset detection (also known as beat detection)
in pieces of music, for example [4, 15, 31], and the current state of the art technology as used in
software such as Native Instruments’ Traktor DJ software allows the computer to accurately detect the
tempo of a piece of music (particularly electronic music, which tends to have a more clearly defined
rhythm, often in the form of quite a simple 4/4 beat) and analyse the phase of the beats of two songs
in order to “beat match” them, as a human DJ does. A large amount of the research is focussed upon
separating the drum beats from the surrounding layers of instrumentation, which is not a great concern
for this project, as the software is designed to only capture the sound of a drum kit – however, it is
still necessary for the beat to be extracted from background noise, which will inevitably be present if
the sound is captured with a microphone.
Advanced approaches use techniques such as isolating each frequency band in which a beat is likely
to occur and analysing these independently [15], but one common approach [4, 5] to detecting the
13
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
onsets in the resulting output is to analyse the energy of the signal. The short-time energy of an audio
signal at a given time “provides a convenient representation of the amplitude variation over time” [35]
(such that a large amplitude variation results in a higher energy value), and can be estimated for a
frame of audio k by the equation below [27], where |k| represents the number of the samples in the
current frame (referred to as the energy sample size), and S(n) represents the n’th sample of k.
Equation 2.1: Short-time energy calculation
|k |
E = ∑ S ( n)
k
2
n =1
As noted in [5] sound energy typically increases at the time of an audio event, such as a new note or
sound being played, and in particular percussive sounds, due to their sharp attacks (as illustrated in
Figure 2.43), produce a large change in energy. This would suggest that in order to detect percussive
events, it is sufficient to detect these sudden changes in energy. In order to take background noise in
to account, the average energy over a period of time before the current frame can be determined, in
order to calculate an average energy history level. The period of time used for this energy history is
variable – too short a history may not sufficiently average out variations in the background signal,
resulting in energy peaks being less defined, whilst too long a history may result in extraneous peaks
being detected . The details and performance of the implementation of this algorithm can be seen in
sections 3.5 and 4.2.
Attack
Release
Release
Figure 2.3: Two sounds, one a synthesized sound with a long attack, the other a snare sound with
virtually none.
2.6 Drum separation/classification
A separate, and more complex problem in producing a solution which is more usable and requires
little in the way of specialist equipment is that of identifying which drum is being played for a given
detected onset. The ideal scenario would be to have multiple directional microphones, each capturing
the sound of one drum, going in to multiple inputs on the computer. In this case, most percussive
sounds could be identified by performing simple onset detection as described in section 2.5 on each
input (which would filter out background noise, such as the “bleed” from the other drums) - it would
14
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
be known in advance which drum each input was assigned to. However, such a solution would be
impractical for several reasons – it would require a minimum of four separate microphones, and a
soundcard with at least four separate inputs (typically only found in professional audio soundcards).
Also, it is possible that processing several input streams simultaneously would be too CPU intensive,
and finally such a solution would offer little innovation in its implementation.
Numerous researchers have approached the problem of separating multiple drum sounds from a single
input with varying degrees of success [15, 34]. One common feature of many of these approaches is
the use of FFT analysis in order to analyse the frequency content of each drum hit. Template FFT
vectors for each drum could therefore be calculated during calibration of the software, by storing the
average of a number of FFT analyses on each drum, and each detected onset could then be classified
by determining which template FFT its spectrum matches most closely.
Such an approach, if sufficiently accurate, would be relatively simple to implement, but as
demonstrated by the figures showing frequency analyses of drum sounds below, a single analysis of
each drum hit would likely not provide sufficient spectral information from which to classify each hit.
Figure 2.4: Waveform view of bass, snare, closed hi hat and open hi hat drum hits
Figure 2.5: Spectral view of the same drum hits as above, with time on the x-axis and frequency (from 10Hz to
20,000Hz) on the y-axis. The darker the area, the stronger that frequency is at that point.
Figure 2.6: 32-band FFT analysis of the first 1024 samples of each drum hit, in the same order as in Figure 2.4
(averaged over ten samples).
15
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
It can be seen that from this sample of four different drums, the frequency content (particularly that of
the first 1024 samples of audio) does not appear to vary sufficiently from drum to drum to allow
accurate classification (for example, the closed and open hi hat appear almost identical). This is to be
expected as drum sounds caused by striking a metal drum, such as a hi hat, are essentially wideband
noise (which is how they are often modelled in simple analogue drum synthesizers) and so the
frequency analysis will show no obvious frequency peaks. It is for this reason that an approach such
as that taken by [31] which takes in to account the change of the frequency content over time
throughout each drum hit is likely to be much more successful – however, such an approach would
require more time for processing each drum hit, which may not be ideal in a system designed
exclusively for real time operation.
Another issue when considering the real time nature of the software is that of acceptable latency – a
combined analysis of the frequency content of the hit and the shape of its envelope, for example,
would allow easy distinction between the closed and open hi hat sounds due to their different
envelope shapes, but the software would need to sample the input for long enough to capture the
entire hit before classifying the drum sound.
For the software to be made more useful, it would ideally be able to detect two or more drums struck
at the same time, which presents a more complex problem. Some combinations of drums are unlikely,
or even impossible, on a standard drum kit – for example, the closed hi hat and open hi hat are played
on the same drum and so cannot occur at the same time, allowing certain combinations to be ignored.
Even so, a comparison of the FFT’s of, for example, a snare drum alone and a bass drum and a snare
at the same time show very little difference between each other – it is a complex task to enable the
software to detect when two sounds are being played, and when it is just one sound.
One method for recognising such combinations is inspired by [31], using a fitting-subtraction model
such that the best fitting template FFT (or in the case of [31], Bounded Q Transform vector, a similar
algorithm which provides better lower frequency resolution than the FFT [9]) for the captured onset is
selected, and subtracted from the captured vector. If the resulting vector is determined to still contain
pertinent information (i.e. spectral information which suggests another drum hit took place
simultaneously), the process is repeated to find the best fit for the subtracted vector until it appears
that all the hits occurring have been captured. Such a solution sounds like a sensible approach, but
there are few details of the specifics of the implementation in the paper and the problem remains that
the FFT spectrums of some drum sounds alone and mixed with another are very similar, so the
detection of more than one drum simultaneously may not be possible in the prototype software.
16
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
2.7 E-learning systems
Whilst the system as presented here does not represent a true e-learning system as it is designed to
compliment traditional tuition methods and increase their effectiveness rather than teach a user to play
the drums from scratch, some knowledge of good e-learning practice is useful. According to [3], a
successful e-learning system should:
•
“be interactive and provide feedback”
•
“have specific goals”
•
“motivate, communicating a continuous sense of challenge”
•
“provide suitable tools”
•
“avoid any factor of nuisance interrupting the learning stream”
The proposed system meets the first two criteria – it is interactive, in that the users playing is analysed
in real-time by the software and it responds appropriately, and provides feedback based upon the
accuracy of their playing. Specific goals can be set for a lesson in tutor mode (see 3.3), such that a
student can only advance beyond a certain lesson when they reach a certain accuracy level. The other
three criteria are not really applicable, as the prototype is not designed to offer tuition to the user.
17
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
3 Design and Implementation
3.1 Methodology
There exist a number of design and programming methodologies suitable for the project. Amongst the
candidate methodologies are the Spiral model, which combines elements both designing the software
in advance and allowing changes during the development; the Waterfall model, which is carried out in
a series of stages such that the system requirements are first designed, followed by implementation,
evaluation and “maintenance” of the software; evolutionary prototyping, in which a prototype is built
and modified as seen to be appropriate during the development, with a focus on producing a high
quality prototype limited in scope, rather than a rapidly developed system with a wide scope but
rushed implementation; and Extreme Programming, which favours simple design, with rapid
iterations and continuous testing of the solution, in addition to techniques such as pair programming
[18].
The methodology followed during the development of the software is closest to that of the
evolutionary development model, with some values of extreme programming. Important concepts in
the design and development methodology included:
•
starting with the simplest solution and subsequently developing additional functionality
(as in Extreme Programming);
•
a focus on implementing features such that they work reasonably well, rather than
attempting to fit in as many features as possible (as in the evolutionary prototyping
model);
•
making an effort to avoid spending too much time on the design, which can result in the
implementation suffering in a small, solo project such as this (a common feature of both
approaches);
•
and a focus on getting actual code written, rather than planning everything in detail first.
This said, the system design was still considered before starting any programming, and some
modelling techniques such as class diagrams have been used for illustration.
18
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
3.2 Software Architecture
After considering possible solutions to the problem, and taking in to account the background research,
a rough architecture for the software was designed, as illustrated by the class diagram in Figure 3.1.
During the development process, and in line with the chosen development methodology, changes
were made to the design to allow for easier coding or better performance, and these changes are
shown in the diagram, by stating the class in the actual program which provided the functionality
specified.
Figure 3.1: Class diagram representing the core components in the software architecture.
Names in parentheses indicate the class in the final software providing the functionality.
The intended functionality of each class is as follows:
•
Lesson loader: read lesson and pattern files from disk, stored in an XML format, and pass
the drum hit events contained within each pattern to the timing analysis class, in order to
display the pattern in the user interface (the actual solution sent the events directly from
this class to the UI class, and the lesson functionality was not implemented due to time
constraints).
•
Audio input: capture the microphone’s signal via the computers soundcard, and pass the
captured data to the onset detection class (in fact, the audio input functionality was
implemented as a part of the onset detection class).
•
Onset detection: process the audio input to detect onsets, representing drum hits, and
determine which drum was played, then pass onset events to the timing analysis class.
•
Timing analysis: maintain a timer such that all beat events can be “timestamped”, and
also provide a metronome to play along to.
•
User interface: display a clear representation of the current loaded pattern along with the
users input, and allow the software to be controlled (in the final implementation, the user
interface class also analyses the timing differences between the input hits and the pattern).
•
Audio output: output audio signals such as playing back the drum pattern or metronome.
19
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
3.3 User Interface
As this project is intended to demonstrate a prototype piece of software based upon the research,
rather than investigate new areas in onset detection and drum classification, the design of the user
interface represents quite an important part of the solution. There are three distinct modes of
operation, selected from a dialog box upon loading the software.
Tutor Mode and Record Mode
Tutor mode was intended to provide an interface for a tutor to construct a lesson plan (for example,
for practice from week to week). Each lesson plan consists of a number of separate drum patterns,
arranged in a logical order (for example, gradually getting more complex, or faster), and each
individual pattern in the lesson has a number of parameters which can be set – the percent accuracy
score that the student needs to achieve in order to automatically move on to the next pattern, the
default tempo of the pattern, and (optionally) a maximum number of tries at the pattern (with each try
being a one bar loop) before the previous pattern is returned to.
Figure 3.2: A screenshot of the lesson planner window
20
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
New patterns can be added to a lesson either from a pre-existing pattern file (see Appendix C), or by
recording a new pattern. When recording a new pattern, the record mode user interface (based on the
student mode interface) is used to represent the played drum pattern, and the process is as follows:
1. Decide upon the initial tempo for the pattern by adjusting using the tempo adjustment buttons
in conjunction with the metronome.
2. Play the pattern on the drums once through (recording starts automatically on the first
detected beat)
3. In the event that it is played incorrectly, keep playing it through until it is played correctly.
4. The captured pattern is automatically quantized, using the algorithm in Code example 1, and
the original recorded pattern is shown alongside the quantized pattern, as shown in figure 3.3.
5. Select “Play” to listen back to the recorded pattern, and “Save pattern” if it has been captured
correctly, or “Record again” to rerecord it.
•
•
•
quantizeDivisor = 4 // Quantize to 1/16th notes
quantizeInterval = beatTime / quantizeDivisor
For each recorded beat r in recordedBeats:
o nearestIntervalPre = Round down ((time of r in ms) / quantizeInterval) *
quantizeInterval
o nearestIntervalPost = nearestIntervalPre + quantizeInterval
o If the time of r in ms is closer to nearestIntervalPre:
ƒ Set the time of r in ms to nearestIntervalPre
o Else:
ƒ Set the time of r in ms to nearestIntervalPost
Code example 1: Simplified pseudo-code of the recorded beat quantization algorithm, implemented
as QuantizedRecordedBeat() in the UI class, designed to remove small timing variations in the target
pattern by “snapping” events to regular intervals (in this case, 1/16th notes)
The final arrangement of patterns is saved to an XML file, as well as the patterns themselves, which
could then easily be distributed to students. The tutor mode user interface is implemented using
standard Windows Forms controls, in addition to the use of a modified student mode UI for recording.
Unfortunately, due to time constraints and the fact that it was not a part of the minimum requirements,
the lesson mode functionality could not be implemented beyond designing the user interface, although
the functionality to record, quantize and store a played pattern is present.
21
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Figure 3.3: The record mode UI. The captured pattern can be seen with black boxes representing
the captured hits, and blue boxes representing those hits after quantization.
Student Mode
Student mode is designed to allow the user to play through a previously stored lesson and receive
feedback upon their playing. The current stored pattern is represented graphically, as shown in Figure
3.4, with each row representing a separate drum part and the horizontal direction representing the time
of the hit (in this prototype, limited to one bar), and as the user plays, their detected drum hits are
displayed below the appropriate row of the stored pattern, allowing it to be seen how their playing
matches up to the expected playing. The boxes representing a users hit are coloured based upon how
close to the expected drum hit their playing was – if a hit is perfectly on time (i.e. it appears directly
below one of the expected pattern hits) the box is coloured green, moving towards red as the accuracy
of the hit decreases. Erroneous hits for which no candidate pattern hit can be determined (for example,
if the wrong drum is struck) are coloured red.
22
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Figure 3.4: The student mode user interface, displaying a simple two drum pattern and
a user input, played slightly late. The larger numbers along the top represent the four beats in a bar,
and the smaller number represent the 1/16th notes in a beat.
In addition to the main graphical representation of the users performance, statistics on the accuracy of
the users playing are displayed (due to unsolved problems with changing the caption of a label in the
interface during runtime causing the software to use as much memory as it could before crashing, they
are output to the console in the final version), showing average accuracy of the previous bar the user
played, and the overall average accuracy for the current lesson so far, as well as the target pass
accuracy. If the average accuracy for a certain drum pattern is low, the tempo is automatically
lowered so that the user may gain familiarity with it at a lower tempo, before offering to increase the
tempo again. Details of the implementation of the timing analysis are in section 3.7.
Additional controls in the user interface allow the user to alter the current tempo, to restart the current
pattern and wait for them to play another beat before continuing, to play the current pattern through
the audio output (see section 3.8) and to calibrate the software (for information regarding calibration,
see 3.6). Were the implementation of the lesson system to have been completed, buttons would have
been added to move forward and backward through patterns in the current lesson.
The graphical elements of the student mode interface are created using Microsoft Direct3D (see
2.2.2), which allows for pixel accurate placement of graphics and easy control of visual effects such
23
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
as transparency, which is used to fade out the boxes representing older hits to avoid clutter. The three
main elements of the UI – the text labels, the buttons and the boxes (used to represent the drum hits) –
are implemented as separate classes (D3DLabel, D3DButton and Box/SpecialBox representing
unfilled and filled boxes) – as only basic functionality was required - for example, the D3DButton has
a method ContainsPoint() which returns if the position of the mouse when clicked is inside the
boundaries of the button or not - and the Direct3D SDK examples of such UI controls appeared
somewhat overcomplicated
3.4 Audio input
For the purposes of this project, it was decided upon to use a sampling rate of 44,100Hz, an 8 bit
sample depth and a mono (single channel) input format. The reasons for these choices were that a
sampling rate of 22,050Hz (the most common rate below 44.1Khz) cannot, according to the Nyquist
theorem [2], correctly represent a frequency of above around 11khz. Whilst such high frequencies
may or may not be important to the performance and accuracy of the software, it seems prudent to
retain them. A sampling depth of 8 bits was chosen due to the fact that the additional accuracy, whilst
significantly improving the quality of audio when played back, does not significantly affect the
representation of more simple sounds such as drums, and so in the interests of simplicity and
performance, half the amount of data can be processed with little trade-off. Microphones are virtually
always mono (with the exception of specialist binaural microphones), and so there is little sense in
capturing a mono signal in stereo. This chosen format has the added benefit of being a common input
format and is therefore supported by the majority of soundcards.
The audio input component of the software is implemented in the same class as the beat detection
component (contrary to the original proposed classes from the modelling in 3.1), as the interclass
communication would be so frequent that it was not deemed necessary to separate them. The code for
capturing the audio from the sound card is based upon the “CaptureSound” example provided with the
DirectX SDK [21], and makes use of the DirectSound API.
An audio capture buffer is created for the primary audio input (as selected in Control Panel) with the
chosen sample rate, depth and number of channels, and when the “Start” method of the capture buffer
is called, it is constantly filled with input data from the default sound input in a circular fashion (i.e.
new data replaces the oldest data once the buffer is full). To process the audio at regular intervals,
notification events are assigned to positions in the capture buffer – whenever a given position in the
buffer is reached, a NotificationEvent is triggered, which in turn calls the method to process the
newest chunk of audio. The capture buffer size remains fixed at one second of audio, and so the
frequency of the notification events determines the latency of the input – the more frequent the
24
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
notification events, the more regularly captured audio is processed, as illustrated in Figure 3.5.
Notification position – in this example, there are only
four so a notification event is fired every ¼ second
Position
(samples)
0
11025
22050
33075
44100
Capture buffer
Newer data
Write position
pointer
Old data (to be
overwritten)
On writing to the end of
the buffer, the write
position pointer wraps
back to the start
Figure 3.5: An illustration of the circular buffer and notification positions used for capturing the input
3.5 Onset detection
As described in section 2.5, there exist a number of approaches to onset detection, many of which are
designed to tackle the complex problem of extracting beats from a piece of music containing other
elements in addition to the drums. It was not deemed necessary to investigate the performance of the
more complex algorithms for the project unless a more simple (and therefore better suited to real-time
performance) algorithm proved to be insufficient.
The onset detection method, ProcessAudioFrame, is called whenever a notification event in the
audio capture buffer is reached (see above). It was decided to implement an energy based onset
detection algorithm, as described in section 2.5 (as mentioned in [5]). The implementation was
inspired by informal discussion in [26], in which the local short-time energy of a given number of
input samples is calculated by implementing Equation 2.1, and a fixed length history buffer of
previous energy values is maintained, allowing the average energy value over a chosen amount of
time in the past to be calculated from the mean of all of the values in the history buffer, which is then
compared to the new short-time energy value.
Whilst this basic implementation of the algorithm successfully detected onsets, it was also observed to
detect a number of erroneous onset events. The two main problems apparent in the implementation
were that a single beat was sometimes being detected as multiple onsets (which is logical, considering
that the energy value during the entire time of the drum hit is likely to be higher than the average
energy), and that on introducing some artificial background noise (by mixing the drum track with
white noise of varying amplitude), onsets were detected at small increases in local energy which were
25
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
actually a change in the background noise. In order to counter the second issue, the average past
energy is multiplied by a threshold value (as indicated in Figure 3.6 and Figure 3.7), and to avoid the
spurious onsets during a hit being detected, the algorithm was modified such that when the threshold
value is crossed, no more onsets are detected until the local energy goes below the threshold value
again, as can be seen in the pseudo-code version of the algorithm in Code example 2.
Detected
onset
Short-time
energy
Average energy ×
threshold (1.5)
Average
energy
Figure 3.6: Plot of energy levels against time with a clean input, using
an energy sample size of 1024 samples and an energy history buffer of 1 second
Figure 3.7: Plot of energy levels with background noise added, using same parameters as the previous plot.
26
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
With these two modifications, the onset detection appeared to work perfectly for simulated drum
inputs with even quite high noise levels in the background (generated using Adobe Audition and
royalty-free drum kit samples provided with Computer Music magazine, and recorded via the “Stereo
Mixer” input of the soundcard, which feeds the audio output back in as an input), and surprisingly,
even performed quite impressively with recorded pieces of music with a strong beat. No further
improvements to the onset detection algorithm were deemed necessary.
•
localEnergy, historyEnergy = 0 // Initialize the variables
•
For each sample value s in the unread section of the buffer:
o
•
localEnergy = localEnergy + (s × s) // Add s2 to the running local energy total
For each energy value e in localEnergyHistoryQ:
o
historyEnergy = historyEnergy + e // Sum all of the previous energy values
•
historyEnergy = historyEnergy ÷ (size of localEnergyHistoryQ) // Calculate the mean
•
If localEnergy > historyEnergy × thresholdValue: // Possible onset detected
o
If (currentlyInBeat = false): // If not currently in a beat…
ƒ
•
Onset detected // … a new onset is detected
Else If (currentlyInBeat = true): // If in a beat and no possible onset detected…
o
•
currentlyInBeat = false // … then no longer in that beat
Remove element 1 from localEnergyHistoryQ // Remove the oldest energy value
•
Add localEnergy to the end of localEnergyHistoryQ // And add the newest value
Code example 2: Simplified pseudo-code of the onset detection algorithm
3.6 Drum classification
Whilst a simple version of the software could be useable without attempting to classify each drum hit,
and instead just detecting each onset and analysing the timing, and assuming that the correct drum has
been played, the feedback available would not be especially useful, particularly in terms of the
graphical representation of the users playing. It was therefore necessary to consider some of the
approaches to drum classification discussed in section 2.6.
It was clear that frequency analysis of each drum hit was likely to provide the most useful feature set
from which to classify the sounds, and so it was necessary to implement a Fast Fourier Transform.
Rather than implement it from scratch, and likely end up with quite an inefficient algorithm, the open
source Exocortex.DSP library [17], which provides the necessary classes and methods to perform
FFT’s, is written in C# and is “optimized for both speed and numerical accuracy”[17], was tested, and
found to be easy to implement and efficient enough to use in real time.
27
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Initially, the FFT result of each frame of audio was written to a text file, and Microsoft Excel was
used to graph the data. To test that the FFT was working as expected, sine waves at various
frequencies were generated using Adobe Audition (a sine wave is unique in that a pure sine wave at a
given frequency does not contain any harmonics, and so should generate a single peak in the FFT
output in the appropriate frequency bin). As shown in Figure 3.8, the results were as expected – there
is a degree of “noise” in the results due to noise and harmonic distortion introduced by capturing the
output of the soundcard and analysing that, rather than reading the sound directly from a file on disk.
Figure 3.8: FFT output (downsampled from 1024 to 32 bins) from sine waves at
100Hz, 1000Hz, 5000Hz and 10000Hz (peaks from left to right)
In order to capture the FFT of each detected drum hit, the FFTHit() method is called by the onset
detection algorithm (see Code example 2) whenever a new onset is detected, with the current chunk of
the captured audio as its input. In the Exocortex.DSP library, there exist a number of different FFT
methods for one, two and three-dimensional data sets – in the case of audio, the data is one
dimensional, and in the form of real numbers (rather than complex numbers), so the RFFT() method
is called, which performs a one-dimensional, real, symmetric, forward direction FFT on the sample
data. The Exocortex.DSP library performs no downsampling on the FFT, and so the returned array
containing the results must be at least the same size as the input array, resulting in a large number of
frequency bins.
In order to aid processing and analysis of the output, it is downsampled to a smaller number of
frequency bins (the effect of varying the number of frequency bins is discussed in section 4.3.) by
averaging the absolute values (making any negative results positive, as the sign of the frequency
28
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
component is related to its phase, which is not considered here) of a number of frequency bins in to
one output bin – so for example, if an array of 1024 results is to be downsampled to 32 frequency
bins, 1024 ÷ 32 = 32, so each output bin is the average of 32 adjacent input bins. The resulting FFT is
normalized (scaled so that the peak value is 1) to remove the effects of any variations in volume.
The resulting array represents the frequency spectrum of the frame of audio in which the onset was
detected – ideally, it would represent the frequency spectrum of a given number of samples
immediately following the onset itself, but given sufficiently short audio frame sizes, this simpler
implementation should be sufficient.
In order to classify the drums sounds, a template FFT vector for each drum must first be generated.
This is achieved by using the “Calibrate” option of the software. When calibrating, the user is
prompted to strike each drum five times. Each of these five hits is stored, and their downsampled,
normalized FFT calculated. A template FFT vector can then be determined for each drum by
averaging each bin across the five sample hits, and on completing calibration, these templates are
written to a file on disk as space-separated arrays of floating point numbers representing the value of
each frequency component (and this is loaded on subsequent runs).
Once the template FFT’s for each drum are stored, the software can attempt to classify each drum hit
(i.e. each time a new onset is detected by the onset detection algorithm) by performing an FFT on the
captured audio frame containing the onset, downsampling and normalizing this, and selecting the
closest matching template FFT. Several methods for selecting the closest match were considered. The
simplest method is to iterate through every template and for each frequency bin, calculate the
difference between that bin in the template and the same bin in the captured frame. The sum of these
differences is then calculated for each template, and the template with the smallest (absolute)
difference value from the captured sound is selected as the most likely sound. On testing this
algorithm, it performed reasonably – it was able to correctly classify the bass and the snare sounds as
either one or the other, and the closed and open hi hats as one or the other, but had difficulty
differentiating between these two pairs of sounds (which makes sense, considering that the spectrums
of each pair are quite similar to each other).
This simple method could be improved by using advance knowledge of what drum is expected – for
example, if the differences between the templates and the drum sound are such that a bass drum is the
first choice, and a snare drum the second, but the closest expected drum to the time at which it was
struck is a snare drum, the snare drum possibility could be given additional weighting and would
therefore be selected – of course, this would mean that if the user hit the bass drum in error in this
situation, it would still likely be classified as a snare.
29
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
The main problem with this algorithm is that it does not take the “shape” of the FFT in to account – it
is entirely possible that entirely the wrong template could be selected, for example as shown in Figure
3.9 – if FFT 1 were the input, and FFT 2 a candidate template, the difference between the
corresponding bins in the two would be (1 – 0) + (0.75 – 0.25) + (0.5 – 0.5) + (0.25 – 0.75) + (0 – 1) =
0, resulting in FFT 2 being selected as the best match – despite clearly being very different.
1.2
1.2
1
1
1
1
0.75
0.8
0.6
0.75
0.8
0.6
0.5
0.4
0.5
0.4
0.25
0.25
0.2
0.2
0
0
0
0
1
2
3
4
1
5
FFT 1
2
3
4
5
FFT 2
Figure 3.9: An simple example of two different FFT's, for which the difference between the two using the
above algorithm is zero, resulting in a false match being found.
A basic improvement on this algorithm is to consider the differences between adjacent frequency bins
in the candidate template FFT and the captured FFT, instead of the difference between each
corresponding bin. In the above example, the difference between each adjacent bin of the first FFT is
-0.25, and the difference between each adjacent bin of the second is +0.25. On comparing the
differences between the two FFT’s, the final difference value would be: (-0.25 – +0.25) + (-0.25 –
+0.25) + (-0.25 – +0.25) + (-0.25 – +0.25) = -2.
To illustrate the difference between the two algorithms, consider the third example FFT, Figure 3.10,
with FFT 1 as the input, and FFT 2 and FFT 3 as the candidate templates. Using the first algorithm
(the differences between corresponding frequency bins), the difference between the input and FFT 3 is
(1 – 1) + (0.75 – 0.7) + (0.5 – 0.55) + (0.25 – 0.3) + (0 – 0.1) = -0.15, and the difference between the
input and FFT 2 is 0 (as illustrated above), resulting in the wrong match clearly being selected. Using
the improved algorithm, the difference between the input and FFT 3 would be (-0.25 – -0.3) + (-0.25 –
-0.15) + (-0.25 – -0.25) + (-0.25 – -0.2) = -0.1, and the difference between the input and FFT 2 would
be -2, as stated above, resulting in the correct template being selected (the sign of the difference is
discarded, so 0.1 < 2, and therefore FFT 3 is correctly selected).
This algorithm would appear to provide a simple, but hopefully effective method of matching the
input FFT to the best fitting template, and the results are discussed in section 4.3.
30
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
1.2
1
1
0.8
0.7
0.55
0.6
0.4
0.3
0.2
0.1
0
1
2
3
4
5
FFT 3
Figure 3.10: A third example FFT, similar (but not identical) to the first FFT in Figure 3.5.
An effort was made to implement the subtraction method of classifying multiple drum beats at the
same time, as mentioned in 2.6 (inspired by the method in [31]), but the results were very poor – it
would seem that the implementation used was overly simplistic, as subtracting a template FFT from
the input FFT resulted in insufficient frequency information being left to try to fit another template.
The code was modified so (0.5 × the selected template) was subtracted instead, and the resulting FFT
normalized again, but this still yielded no significant results, and so identifying combinations of beats
has been identified as an area for improvement, as it was not one of the original requirements.
3.7 Timing analysis
An important part of the system in terms of making it a useful tool for users is to provide some
feedback based upon their playing, and optionally allow the software to automatically respond to their
playing, for example changing the tempo of the beat if necessary – without this, little advantage
would be offered over solo practice. In order to do so, it is first necessary to be able to keep track of
timing to a reasonable degree of accuracy.
TimingAnalysis
- barTime : float
- barTimeElapsed : float
- beatTime : float
- beatTimeElapsed : float
- metronome : bool
- metronomeCount : int
- playMode : bool
- tickTimeElapsed : float
+ BeatEvent (drum : int)
+ ChangeTempo (tempo : int)
+ GetNextBeatTime ()
+ Initialize ()
+ Play (events : List <float, int>)
+ SwitchMetronomeState (state : bool)
Figure 3.11: UML class diagram of the timing analysis class
31
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
The timing functionality of the system is contained within the TimingAnalysis class (see Figure
3.11 for class diagram), which utilises a timer to keep track of the time elapsed. There exist several
timers available in the Windows environment with varying degrees accuracy, jitter and ease of use.
The software for this project utilises an enhanced version of the Microsoft DirectX SDK high
resolution timer class (which in turn queries the Windows high resolution performance counter),
DXTimer, used with permission from [30], which provides the highest resolution timing available
[23].
The class sends “tick” events to the user interface to move the playback position marker along, as well
as (optionally) sending “MetronomeTick” events to the AudioPlayer instance on every beat. In
addition, the method “GetNextBeatTime” is called by the BeatDetection class when an onset is
detected. This causes the TimingAnalysis instance to store the current elapsed time, relative to
the start of the current bar. When the type of drum present at the offset has been determined (which
depends upon how many frames of audio are being analysed for the FFT), the method “BeatEvent” is
called, which in turn calls the RecieveBeatEvent method of the UserInterface instance. The time
at which the beat was originally detected is passed back to the UI class, allowing it to be drawn in the
correct location on the grid (albeit with some latency if more than one frame is being analysed).
The analysis of the users playing is performed by the UI class. When beat events are received, they
are drawn in the appropriate position based upon the time of the beat and the drum played, and
coloured according to their proximity to the nearest beat defined in the pattern, determined by the
algorithm in Code example 3.
•
•
•
•
window = beatTime / 2
candidates = List of type DrumHit
For each DrumHit d in the stored pattern:
o hitTime = time in the bar at which d occurs, in milliseconds
o userHitTime = time in ms at which the onset was detected
o If (hitTime – window < userHitTime) or (hitTime + window > userHitTime):
ƒ Add d to candidates
For each DrumHit d in candidates:
o
If d is played on the same drum as userHit:
ƒ deviation = Absolute value of (userHitTime – d.hitTime)
ƒ Select the d with the smallest deviation, and return (minimum
deviation / (window * 2)) // Deviation normalized to between 0 and 1
Code example 3: Simplified pseudo-code version of the algorithm for selecting the best
candidate drum hit (FindDeviation())
32
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Additionally, the deviations of each recent drum hit are stored, and are analysed on adding a new hit
to find the average deviation – if the average deviation is above a threshold value, the tempo of the
pattern is lowered in order to allow the user to practice at a slower speed until they have improved
their accuracy, at which point they can increase the tempo again.
3.8 Audio Player
The AudioPlayer class is used to output audio events, using DirectSound secondary buffers. The
implementation of this is very simple – as seen in the class diagram below, there are methods for
playing a metronome tick, playing a single drum sound or playing a combination of drum sounds
(each combination is represented using four bit binary, with one bit per drum – for example, if drums
one and three are to play at the same time, the binary representation of this would be 0101, which
equals 5 in decimal). On creation of an instance of the class, secondary buffers (used to play audio
files from disk) are created for each sound, and on calling one of the methods described previously,
the “Play()” method of the appropriate secondary buffer(s) is called.
Unfortunately, a small amount of jitter (unwanted variations in timing accuracy) was detected on
analysing the output of the metronome compared to a computer generated “click track” at the same
tempo, particularly when CPU or disk usage was quite high. By replacing the secondary buffer calls
with Console.Beep() calls (which causes the internal speaker of the computer to beep) and capturing
the output with a microphone, it could be seen that the timing issues appeared to be caused by
DirectSound, as the internal speaker output was much more accurate. There was not enough time
remaining to determine the exact source of the problems, or to develop a workaround, and so in its
current implementation the audio playback is less than ideal.
33
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
4 Evaluation
4.1 Evaluation methodology
The most significant parts of the software to evaluate were deemed to be the accuracy of the onset
detection, and the accuracy of the drum classification. In order to evaluate these, the software was
modified to output each detected onset, along with the drum selected for it, to a tab delimited text file.
Input files were prepared using Adobe Audition and a library of drum samples, and the timing and
drums used in these inputs were recorded. Comparison of the software output and the actual input was
then performed manually.
The details of the test files are as follows:
•
Timing test 1: four bass drum beats, evenly spaced at 120 BPM (0.5 seconds between
each beat)
•
Timing test 2: variable timing – 12 bass drum hits in a bar at 120 BPM, time between
each one decreases (not locked to a grid)
•
Timing test 3: fast snare roll (1/16th notes) at 140 BPM (0.11 seconds between each beat)
•
Classification tests: looped playback of one drum sample following calibration
4.2 Onset detection accuracy
In order to assess the accuracy of the onset detection algorithm, the output file of hit times (relative to
the start of the bar) was loaded in to Microsoft Excel, and the time of each hit was compared to the
expected time of the hit. The mean and standard deviation of the differences was calculated, to
indicate the average amount of error, and the range of the errors occurring.
In order for the onset detection to be effective, it should be able to detect every beat played, without
missing any beats, to a good degree of accuracy (preferably around 1/100th of a second). Ideally, it
should perform well if the input signal is degraded slightly, for example by background noise, which
could mask certain beats or be detected as false onsets.
Two factors are likely to affect the accuracy of the onset detection. The size of the short-time energy
samples determines the timing accuracy of the onset detection, as, provided that the audio is processed
at least as often as the energy calculation is carried out, the number of samples that must be collected
before comparing the current energy to the average energy will determine the timing resolution to
which an onset can be detected. The length of the energy history used determines how long a time is
34
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
used to calculate the past average energy level, affecting whether onsets are detected as onsets or are
“averaged out” – too long an energy history length may result in the average past energy taking in to
account too much of the preceding audio, causing the calculated value to be too high, resulting in
some onsets not being detected, whilst conversely, too short a energy history length may result in the
average energy history not taking in to account enough of the background noise, resulting in offsets
being detected at points where no offset occurred.
When understanding these results, it is important to bear in mind the way in which the onset detection
works, particularly with regards to the timing resolution to which onsets can be detected – the
frequency of the audio being processed is restricted by the size of the energy samples taken, as
illustrated below. In this illustration, the dotted lines represent the boundaries between each energy
sample calculated, and so all three of the onsets illustrated would be detected at the same time, as they
fall inside the same energy sample frame.
energy sample size
Figure 4.1: An illustration of how energy sample size affects timing resolution of onset detection.
Effects of variation of the energy sample size
The larger energy sample sizes (512, 2048) were tested with all three timing tests. The smaller sample
sizes (below 512) introduced a large amount of erroneous beats, and so timing test 1 was sufficient to
confirm that the results were not satisfactory. Due to the manual analysis of results, some erroneous
beats detected may have been included in the calculation of the mean and standard deviations.
35
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Table 1: Showing the effects of the energy sample size on timing accuracy
Test Run
Timing test 1,
Sample size = 2048
Timing test 2,
Sample size = 2048
Timing test 3,
Sample size = 2048
Timing test 1,
Sample size = 1024
Timing test 2,
Sample size = 1024
Timing test 3,
Sample size = 1024
Timing test 1,
Sample size = 512
Timing test 2,
Sample size = 512
Timing test 3,
Sample size = 512
Timing test 1,
Sample size = 256
Timing test 1,
Sample size = 128
Mean
difference
(s)
Standard
deviation of
differences
Beats
correctly
detected
Erroneous
beats detected
0.023
0.015
64
0
0.041
0.043
256
0
0.024
0.016
256
0
0.011
0.050
64
0
0.016
0.031
256
0
0.010
0.009
256
0
0.006
0.010
64
3
0.010
0.006
256
23
0.005
0.008
256
12
0.005
0.012
64
29
0.004
0.009
64
68
As can be seen from the results, a sample size of 1024 samples produced the best results, with no
erroneous beats being detected with any of the test inputs, and an average deviation from the actual
timing of 0.012 seconds. The timing accuracy with a size of 2048 samples is about half that of 1024
samples, as would be expected, and the smaller sample sizes are too prone to detecting erroneous
beats, for example during a single drum sound.
Variation of energy history size
As discussed above, varying the energy history size should determine how the background noise level
affects the accuracy of the detection of onsets.
In order to firstly decide if the effect of the energy history size had an effect on the accuracy of the
lower sample sizes (which offer better accuracy, but a number of erroneous beats), the energy sample
size was fixed at 256 samples, and the energy size halved and doubled. However, no significant effect
was observed, and so a fixed sample size of 1024 (at which the optimum results were found) was
decided upon.
36
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Timing test 1 was used for each test, as the interest was in how many beats were detected correctly
and how many erroneous beats were also detected, not in the timing accuracy, so a simple drum
pattern was sufficient. Background noise was mixed with the test input in order to produce quite an
exaggerated simulation of background noise that might be encountered using the system with a real
setup, mixed at a level such that approximately 32 false positives were being detected per 64 beats,
using a sample size of 1024 samples and a history length of one second. More details on the addition
of background noise are in Appendix D.
The results of using different energy history lengths are presented in the table below, in terms of how
many beats out of 64 were successfully detected, and how many (if any) extra beats were detected.
Table 2: Showing the effects of varying the energy history size on the onset detection process.
Test Run
Timing test 1,
Energy history = 0.25 sec
Timing test 1,
Energy history = 0.5 sec
Timing test 1,
Energy history = 0.75 sec
Timing test 1,
Energy history = 1 sec
Timing test 1,
Energy history = 2 sec
Timing test 1,
Energy history = 4 sec
Timing test 1,
Energy history = 8 sec
Beats correctly
detected
64
Erroneous
beats detected
32
64
15
64
3
64
25
64
0
63
0
59
0
The results clearly suggest that a history length of around 0.75 seconds produces optimal results –
however, as this is input is likely not representative of a real input, the three best candidate history
lengths, 0.5, 0.75 and 1 seconds, were tested with a more complex and realistic input, consisting of a
one bar drum pattern containing each type of drum (not played simultaneously) with a total of 16 hits
per bar. The sample was looped four times, giving a total of 64 hits for the test run. A low level of
background noise was added to the signal at approximately 40% of the volume of the background
noise added in the previous tests.
Table 3: Showing the effects of varying the energy history size with a more complex input.
Test Run
Energy history = 0.5 sec
Energy history = 0.75 sec
Energy history = 1 sec
Beats correctly
detected
64
64
61
37
Erroneous
beats detected
4
0
0
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
These results suggest that an energy history size of 0.75 seconds is optimal, as it detects every beat in
the more complex input without any erroneous results, and detects onsets accurately with few false
positives even at high background noise levels. As mentioned earlier, an energy sample size of 1024
samples per short-time energy calculation performs best for the test inputs, as it results in good timing
accuracy, again without detecting any erroneous beats.
4.3 Classification accuracy
The accuracy of the classification algorithm implemented was evaluated by looping an individual
drum sound 40 times, and analysing the output to create a confusion matrix. The main variable likely
to affect the classification accuracy with the original algorithm is the number of bins to which the FFT
is downsampled to, and so the FFT size was varied and the results recorded in confusion matrices.
FFT size: 32
Bass
Snare
Closed HH
Open HH
Bass
21
19
0
0
Snare
11
29
0
0
Closed HH
0
0
21
19
Open HH
7
10
2
21
Bass
Snare
Closed HH
Open HH
Bass
24
16
0
0
Snare
16
24
0
0
Closed HH
3
3
28
6
Open HH
1
0
24
15
Bass
Snare
Closed HH
Open HH
Bass
40
0
0
0
Snare
6
20
12
4
Closed HH
1
11
22
6
Open HH
0
3
9
28
FFT size: 64
FFT size: 128
38
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
FFT size: 256
Bass
Snare
Closed HH
Open HH
Bass
29
0
10
0
Snare
4
2
5
29
Closed HH
5
6
24
5
Open HH
1
5
2
32
Bass
Snare
Closed HH
Open HH
Bass
8
6
22
3
Snare
35
3
1
1
Closed HH
4
10
13
13
Open HH
23
12
1
4
FFT size: 512
As can be seen, increasing the number of bands in the FFT used increases the accuracy up to a size of
128 bins, after which accuracy begins to decrease. This initially seemed like an error, so every result
was triple checked, with the same trends coming out every time. In fact, these results would seem to
highlight quite a serious flaw in the implementation of the classification algorithm, the impact of
which was unfortunately not realised until it was too late to work around it.
It is believed that the error stems from the fact that the FFT is calculated for the current frame of
audio – i.e., a frame 1024 samples long. This means that when an onset is detected, a different region
of the audio is being used for the FFT each time, as illustrated previously in figure 4.2. As the
frequency components of each sound change over time, different results will be yielded depending
upon how much of the start of the sound is processed – for example, consider the case where the onset
comes right at the end of the current frame – the FFT will be calculated on this frame, but it will
hardly contain any frequency information as the majority of the onset of the sound falls outside of the
frame.
An explanation for the trends seen in the change in accuracy of the classification as the FFT size is
varied, therefore, is that with a small FFT size, many frequency components are grouped together in
to one bin, and this lack of precision does not allow small differences between sounds with similar
spectra in the first 1024 samples to be differentiated between. On reaching 128 bins, there is a good
amount of frequency information with which to differentiate between quite similar sounds, but on
increasing the FFT size further, the extra detail captured begins to highlight more severely the issues
caused by the different regions of the same drum sound being processed, and so the accuracy
39
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
decreases again.
Had there been sufficient time, a suitable workaround would likely have been to detect the precise
location of the onset within the frame, and then process a given number of samples of audio after this
(which may require waiting for another frame of audio to be captured), meaning that the region of the
sound considered for the FFT should be approximately the same no matter where in the frame the
onset occurs. It was not clear how to implement such a change to the algorithm quickly however, and
so instead an effort was made to compensate for this phenomenon by capturing more than one 1024
sample frame of audio after detecting an onset before performing the FFT, in the hope that ensuring
that a reasonably large portion of the drum sound was captured regardless of the location of the onset
within the frame would suffice.
The FFT size was fixed at 128 frequency bins, as this yielded the most accurate results in the previous
testing, but additional frames of audio were captured after each onset, and these frames were then
joined together and the FFT analysis performed on a longer sample of audio data, hopefully better
representing the frequency spectrum of each drum.
Results using 1 extra frame of audio after the onset (i.e. 2048 samples)
Bass
Snare
Closed HH
Open HH
Bass
40
0
0
0
Snare
0
39
0
1
Closed HH
3
0
36
1
Open HH
0
40
0
0
Results using 2 extra frames of audio after the onset (i.e. 3072 samples)
Bass
Snare
Closed HH
Open HH
Bass
40
0
0
0
Snare
0
40
0
0
Closed HH
0
19
7
14
Open HH
0
32
1
7
40
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Results using 3 extra frames of audio after the onset (i.e. 4096 samples)
Bass
Snare
Closed HH
Open HH
Bass
38
0
2
0
Snare
0
40
0
0
Closed HH
0
30
10
0
Open HH
0
27
5
8
Results using 4 extra frames of audio after the onset (i.e. 5120 samples)
Bass
Snare
Closed HH
Open HH
Bass
37
0
2
1
Snare
0
40
0
0
Closed HH
0
23
3
14
Open HH
9
5
0
26
Results using 6 extra frames of audio after the onset (i.e. 6144 samples)
Bass
Snare
Closed HH
Open HH
Bass
13
0
26
1
Snare
0
40
0
0
Closed HH
0
40
0
0
Open HH
0
4
2
36
The results from adjusting the number of frames captured after the onset are interesting, as different
sample lengths seem to perform better for different drums. The snare drum was detected correctly in
every instance except one, suggesting that its frequency spectrum is unique enough from the other
three to be easily identifiable. The bass drum was detected correctly 97.5% of the time with up to 4
extra frames of audio being captured, but when increased to 6 extra frames, it was detected correctly
only 32.5% of the time. The closed hi hat was detected very well with one extra frame of audio
(correctly identified 96% of the time), but on increasing the number of extra frames captured, it
became more and more confused with the snare drum. The open hi hat behaved in the opposite way –
with few extra frames, it was confused with the snare, but with 6 extra frames it was correctly
identified 96% of the time.
41
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Taking in to account the waveforms of each of the drum sounds used, it would appear that shorter
FFT sample times favour short sounds, such as the bass drum and the closed hi hat, whilst longer FFT
sample times allow the distinguishing frequency characteristics of the open hi hat to be captured
better. This lack of consistency highlights the fact that FFT analysis alone does not appear to provide
a sufficiently unique feature set for classification – as the length of the drum sound appears to affect
the accuracy, this would suggest that analysing the volume envelope of the sound in addition to the
FFT would give more accurately classified results (however, time was not found to confirm this). That
said, the classification with four extra frames of audio performed very well with the exception of the
closed hi hat, classifying 66.2% of the hits (including the closed hi hat) correctly, rising to 85.8% if
the closed hi hat is excluded.
The original classification accuracy was quite disappointing, given that the samples used were
identical every time. As discussed, this is likely a side effect of not accurately extracting the data from
the onset point onwards for the FFT, and the experimentation with adjusting the number of additional
frames captured seems to suggest that if a reasonably long frame were to be captured, and the onset
located to return only the audio past the onset point, much greater accuracy with the sample set would
be achieved. However, to incorporate more drums in to the software, it is likely that a better set of
features would need to be extracted, in order to sufficiently distinguish between them. Examples
include the change in frequency components over time [31] or the various statistical analysis
techniques discussed in [34]. Additionally, the template matching algorithm used is fairly simple – it
performs a simple best-fitting approach, but there are most likely better approaches. Further research
in to suitable methods of selecting the best template would likely yield improved classification
accuracy, for example using the clustering methods discussed in [6].
42
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
4.4 General observations
In terms of performance, the software runs well. It occupies close to 100% CPU whilst running, most
likely due to the unoptimized coding style, but is graceful about allowing other applications to use the
CPU. The graphics performance is good, with the exception of occasional slow down when other
applications are also drawing to the screen, for example the scrolling of the waveform display in
Adobe Audition.
There is a slight, but noticeable, latency between an onset being detected and being displayed on
screen, introduced mostly by the need to template match the captured audio frame with the templates,
and possibly capture more than one frame of audio before attempting to classify it. It is for this reason
that the time of the onset detection is cached by the TimingAnalysis class before actually sending an
onset event to the UI class – the drum hit will appear in the right place on the grid, but with a slight
delay.
Other aspects of the software performed generally well – the pre-defined pattern is displayed without
problems, and the timing analysis of the users playing works as planned, with the individual hits being
coloured appropriately on screen, and the statistics being refreshed with each new hit. The feedback
provided by lowering the tempo if the users accuracy is poor, and offering to raise it again when it
improves is simple but does work, and the tempo is also adjustable with the tempo up/down buttons.
As mentioned in 3.8, the timing accuracy of the audio playback is quite variable, and so the
metronome tick is not as accurate as it should be.
Whilst all evaluation was done by capturing the output of the soundcard and feeding it back to the
input, using the “Stereo Mixer” input, testing was also performed with a cheap microphone, and it
worked as expected. The poor quality of the microphone meant that the classification accuracy when
playing sounds out of the speakers and through the microphone was very poor, but a better
microphone combined with better classification would address this.
Some features were not implemented which could quite easily have been with time – the lesson
planner user interface could have been made fully functional, which would be an attractive feature of
such a piece of software, and the ability to switch patterns from within the UI could also have been
implemented. However, for a prototype, these were not deemed especially important and so were not
completed.
43
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
5
Conclusions
5.1 Satisfaction of requirements
The minimum requirements of the project are restated below, and it is specified whether each
requirement was met.
•
The software must have a GUI which allows the user to control the software, and to
visualise how well their playing matched the pre-defined pattern – this was implemented,
as can be seen in section 3.3.
•
The software must be able to play back both the pre-defined pattern and the pattern as
played by the user – this was implemented, although some unexpected timing issues
meant that the implementation was not perfect (see 3.8).
•
The software must use beat tracking techniques to extract the rhythmical pattern played
by the user – this was implemented by the onset detection (see 3.5).
•
The software must use suitable analysis techniques to analyse the accuracy of the user's
playing, and provide useful feedback based upon this – this was implemented, see 3.3 and
3.7.
Therefore, each of the minimum requirements of the project was met. The stated possible extensions
to these requirements were:
•
To analyse the input as captured audio, and extract from this the individual drum hits, i.e.
separate out bass drums, snares and hi-hats, using real-time filtering – this was
implemented, using the drum classification algorithm (see 3.6), although the results were
not as good as hoped.
•
Functionality to detect when a user is “struggling” with a certain aspect of the playing,
and to provide relevant exercises to practice this (e.g. a simple timing exercise) – this was
partially implemented, with the functionality to reduce the tempo when the users accuracy
is poor (see 3.7).
•
Keep a record of the users performance over time, and use this information to provide a
progress report – this was not implemented, although the implementation would not differ
greatly from the code used to output the results to a file for the evaluation.
44
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
5.2 Success of the project
I believe that the project was successful in demonstrating concepts which could be expanded upon to
create a useful piece of software. Several of the possible enhancements specified at the start were met,
in addition to all of the minimum requirements, and during the development process other features
such as the lesson planner were introduced to the software, even if they were not completed.
The finished prototype software has the ability to capture a one bar drum pattern, using four different
drum sounds, quantize it so that small timing inaccuracies are removed, and store this to disk. The
stored drum pattern can then be loaded, represented usefully on screen and played back. It can capture
and analyse a live audio input in real time to detect drum onsets with a good degree of accuracy, and
once the system has been calibrated to the four different drum sounds, it attempts to classify every hit
according to the closest template drum sound, which is currently measured to be correct around 66%
of the time, but with some (seemingly) quite simple modifications this could likely be improved a
good deal.
It also analyses the timing of the input, and provides graphical feedback by displaying the detected
drum hits. The most likely candidate drum hit in the original pattern can be found for each hit, and
this can be used to calculate the deviation of each hit from the time at which it should have been
played, allowing the software to detect when a users timing is poor and lower the tempo to allow them
to practice at a slower speed.
The background research presented a number of ideas on how to tackle various problems, some of
which were implemented in the code and others of which were deemed to complex, and I believe the
implementation the final solution reflects the amount of background research carried out.
It was slightly disappointing to not be able to work further on the classification problem, and resolve
the best way to classify every type of drum individually, but to quote a previous Final Year Project,
also concerned with analysis of audio signals, “with a project like this it is hard to find a point at
which to stop” [28], and I believe that the background research and implementation of the software
presented in this project represents a good prototype.
45
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
5.3 Possible improvements to accuracy
The onset detection of the software appears to be suitably accurate for use in a more advanced version
of the system, but the drum classification requires a good deal of work before it is properly useable. A
great improvement in the classification accuracy could likely be achieved by taking into account more
than one FFT reading – for example, FFT readings at regular intervals throughout the drum hit – or by
taking into account the amplitude envelope of the drum hit. Both of these approaches would require
quite substantial changes to the algorithms used in the software, but the increase in accuracy would
hopefully be significant.
A more complex problem to solve is that of identifying more than one drum hit at the same time. The
simple subtraction method experimented with during the development did not yield any useful results,
but when combined with a more advanced set of features for the classification, such an approach may
work better.
5.4 Recommendations for further development and research
The main area in to which further research would be valuable is that of identifying drum sounds in
real time, and particularly that of identifying multiple sounds being played at the same time. There are
several possible approaches – the use of two microphones a distance apart from each other may allow
some degree of localization of the source of the sound, allowing specific sounds to be pinpointed; or
further work in analysing the spectrums of the drum hits in more detail may allow for details to be
identified which would aid classification.
Further development of the system is possible in many ways – changes which would be relatively
simple to implement include expanding the software to support patterns longer than one bar (this
could be implemented by displaying one bar of the pattern at a time, and moving on to a new bar as
appropriate) and time signatures other than 4/4, implementing more useful user feedback, improving
the GUI and completing the lesson planner. With the further research mentioned above, it should be
possible to expand the software to capture more than four different drums, and to increase the
accuracy of the drum classification to a point where it is correct nearly all of the time. Also, the
software does not take in to account the velocity of each drum hit.
On a larger scale, the software could be integrated as part of a general drum tuition system, using it as
part of an interactive tutorial for example. Alternatively, the ability to identify onsets in audio signal
and to analyse the frequency content could be adapted to other instruments – for example, a similar
46
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
system may work well for a tuned instrument such as the guitar, and research in to identifying
multiple drum sounds together may aid the automatic recognition of chords.
47
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
6 Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
Albahari, B., P. Drayton, and T. Neward, C# in a Nutshell. 2nd ed. 2003: O'Reilly Publishing.
Allen, J., Short term spectral analysis, synthesis, and modification by discrete Fourier
transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1977. 25(3): p.
235-238.
Ardito, C., et al. Usability of E-learning tools. in Proceedings of the working conference on
Advanced visual interfaces. 2004. Gallipoli, Italy.
Bello, J.P., et al., A Tutorial on Onset Detection in Music Signals. IEEE Transactions On
Speech And Audio Processing, September 2005. 13(5): p. 1035-1047.
Bello, J.P., et al., On the Use of Phase and Energy for Musical Onset Detection in the
Complex Domain. IEEE Signal Processing Letters, June 2004. 11(6): p. 553-556.
Bello, J.P., E. Ravelli, and M. Sandler. Drum sound analysis for the manipulation of rhythm
in drum loops. in Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing. 2006. Toulousse, France.
Bencina, R. and P. Burk. PortAudio–an open source cross platform audio API. in
Proceedings of the 2001 International Computer Music Conference (ICMC-01). 2001.
Havana, Cuba.
Blackwell, A. and N. Collins. The programming language as a musical instrument. in
Proceedings of the 17th Workshop of the Psychology of Programming Interest Group. 2005.
Sussex University.
Brown, J.C., Calculation of a constant Q spectral transform. 1988, Media Laboratory, MIT.
Collins, N. On onsets on-the-fly: Real-time events segmentation and categorization as a
compositional effect. in Proceedings of the First Sound and Music Computing Conference
(SMC '04). 2004. IRCAM, Paris.
Dawson, M., Beginning C++ Game Programming. 2004: Thomson Course Technology.
Drake, C., A. Penel, and E. Bigand, Tapping in time with mechanically and expressively
performed music. Music Perception, 2000. 18(1): p. 1-24.
Eggermont, J.J., Between sound and perception: reviewing the search for a neural code.
Hearing Research, 2001. 157(1-2): p. 1-42.
Fry, B. and C. Reas. Processing 1.0 (Beta). [Web page] 2006 [cited 29/04/2006]; Available
from: http://processing.org/.
Hainsworth, S.W. and M.D. Macleod. Onset detection in musical audio signals. in
Proceedings of the Internation Computer Music Conference. 2003. Singapore.
Hallam, S. and L. Rogers, Survey of Local Education Authorities Music Services 2002, DfES,
Editor. 2002.
Houston, B. Exocortex.DSP - An open source C# Complex Number and FFT library for
Microsoft .NET. [Web page] 2003
[cited 29/04/2006]; Available from:
http://www.exocortex.org/dsp/.
Jeffries, R.E., A. Anderson, and C. Hendrickson, Extreme Programming Installed. 2001:
Addison-Wesley Professional.
Malloch, J., Beat and Tempo Induction for Music Performance. 2005, McGill University,
Montreal, Canada.
Microsoft. DirectSound and DirectMusic. [Web page] 2005 [cited 29/04/2006]; Available
from:
http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/DirectSound_and_DirectMusic.asp.
Microsoft. DirectX Developer Center. [Web page] 2006 [cited 29/04/2006]; Available from:
http://msdn.microsoft.com/directx/.
Microsoft. Visual C# 2005 Express Edition. 2006 [cited 29/04/2006]; Available from:
http://msdn.microsoft.com/vstudio/express/visualcsharp/.
Miller, T., Manged DirectX 9 Graphics and Game Programming. 2004: Sams Publishing.
OpenGL.org. OpenGL - The Industry Standard for High Performance Graphics. [Web page]
2006 [cited 29/04/2006]; Available from: http://www.opengl.org/.
48
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
Oualline, S., Practical C++ Programming. 2nd ed. 2002: O'Reilly Publishing.
Patin, F., Beat Detection Algorithms. 2003.
Pauws, S. CubyHum: A fully operational Query by Humming System. in Proceedings of the
Third International Conference on Music Information Retrieval. 2002. Centre Pomidou,
Paris.
Quested, G., Audio Interface for Performance Tracking. 2004, School of Computing,
University of Leeds.
Sanchez, J. and M.P. Canton, Java Programming for Engineeers. 2002: CRC Press.
Schuld, M. C# Managed DirectX 9 Tutorials. [Web page] 2005 [cited 29/04/2006]; Available
from: http://www.thehazymind.com/archives/2005/10/tutorial_8_rendering_surfaces.htm.
Sillanpää, J., Drum stroke recognition. 2000, Tampere University of Technology, Finland.
Software, B. Jake2. [Web page] 2006
[cited 29/04/2006]; Available from:
http://www.bytonic.de/html/jake2.html.
Talbot-Smith, M., Sound Engineering Explained. 2002: Focal Press.
Tanghe, K., S. Degroeve, and B.D. Baets. An algorithm for detecting and labeling drum
events in polyphonic music. in Proceedings of the first Music Information Retrieval
Evaluation eXchange (MIREX). 2005. London, United Kingdom.
Tong, Z. and C.C.J. Kuo. Hierarchical classification of audio data for archiving and
retrieving. in Proceedings of the IEEE Conference on Acoustics, Speech, and Signal
Processing. 1999. Phoenix, Arizona.
Unknown. Simple DirectMedia Layer. [Web page] 2006 [cited 29/04/2006]; Available from:
http://www.libsdl.org/index.php.
Wilson, M., C# Performance: A comparison with C, C++, D and Java, in Windows
Developer Network Online Supplement. 2003.
Wollrath, A., R. Riggs, and J. Waldo. A Distributed Object Model for the Java System. in
Proceedings of the second USENIX Conference on Object-Oriented Technologies. 1996.
Toronto, Canada.
49
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Appendix A: Reflection
Originally, my proposed project was a system which used beat tracking in order to output a MIDI
clock signal, in order to synchronise other equipment to they playing of a live drummer. However,
after searching the Internet, I discovered that there already existed a (seemingly very good)
commercial piece of software, Circular Logic’s InTime, and there seemed little point in “re-inventing
the wheel” (especially when my work would likely be unable to compare to a commercial
implementation), and so this project was suggested to me by my supervisor.
Looking back on the project and the challenges faced, I believe the change of project was a blessing in
disguise – the solution for the original idea would likely have been very limited in scope and
effectiveness. I initially struggled somewhat on deciding how the software should be implemented,
and what it should actually do in order to be useful, but gradually it became clear that the project
could represent, in my opinion, a prototype of what could be quite a useful piece of software.
As would seem to be a common theme on reading though the “Reflections” section of past projects,
the most significant problem encountered was that of time management. I had not worked alone on a
piece of work of this scale before, and found it all too easy to underestimate the work load and believe
that everything could be done in the last minute – needless to say, this is not the case. I feel my major
downfall was in starting both the more advanced stages of the software development and the writing
of the report later than I should have – in the first semester, I perused many research papers on related
topics, which I felt helped me a great deal in finding inspiration for both the solution and report. I also
mastered the audio input component of the software (after several failed attempts to do so in C++, I
decided to use C#), and I believe that I overestimated how much I had achieved up to that point and
coasted somewhat, until the requirement to complete the project caught up with me.
However, I am pleased with what I have achieved in this project. Working with audio software has
always been a hobby of mine, but never before have I attempted to actually write any of the software
myself. Learning the ins and outs of dealing with audio signals and using techniques such as the
mighty FFT has been an interesting experience, and I hope to be able to use the knowledge gained to
resurrect a piece of currently abandoned open source DJ software, Autopilot “Hummingbird”, when I
find the time.
Developing the software in C# was a pleasure – I had never before used the language (or indeed, paid
any attention to it) before starting the project, but found the syntax and the libraries and
documentation provided as part of the .NET framework a pleasure to use. It helped that I had
50
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
experience in Java (which C# closely resembles), but I feel that even without this, learning C# would
not have been too difficult. Particular features of the language which stand out for me are the
“foreach” statement, which proved useful for iterating through collections; generics, which provide a
sensible way to implement data structures such as queues whilst avoiding the computational cost of
“boxing” and “unboxing” objects; and the clear focus on object oriented development (something
which dissuaded me from using C++ - I had quite a hard time understanding to actually write object
oriented software in the language). In addition to this, Microsoft’s free Visual Studio 2005 Express
did everything I wanted and more – the powerful Autocomplete feature must have saved thousands of
keystrokes!
I feel that the solution developed represents a good prototype in spite of some of the flaws mentioned
in the report, and with further work such as implementing more advanced classification algorithms
and better developing the user interface, a useful piece of software for the intended purpose could be
created – something I was not sure would be the end result when first starting the project.
Overall, the project has been a good learning experience, and I am glad to have developed skills and
acquired knowledge which will hopefully help me work in the areas in which I am interested in the
future. Valuable lessons have also been learnt about time management, development methodologies,
locating peer-reviewed research on topics and report writing, which will undoubtedly come in very
useful.
51
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Appendix B: Informal feedback
I demonstrated the software around one week before the deadline to several friends, and asked for
their comments on both the idea and the implementation. Some of their feedback follows.
Name: Tim Whitehead
Musical experience: Grade 2 drums, plays in a rock band
First of all, having learnt to play the drums yourself, do you think a piece of software such as
the one proposed in the project would be of any use to someone learning them?
I think it’s a great idea, I remember the days of avoiding practicing, and of course if you don’t
practice and you have one lesson a week, you don’t get very far, which is a bit of a waste of money. If
the program was easy enough to use, and accurate, I think it could definitely help encourage students
to practice more often, and getting feedback on your playing as you practice is a great idea.
Bearing in mind that this is a prototype, what do you think of the design of the software?
I don’t know much about computers, so the first thing I noticed was that I would have no idea what to
do! It really needs better help in the program telling you what is happening at each stage and what to
do next. But otherwise, I quite like the idea of seeing your playing along with the beat you are
supposed to be playing and I like the way each beat fades out after you play it. It might be better if
you could look back through what you had played and look for mistakes, and it would be more useful
to real drummers if you could see the beat in proper drum notation.
What about the functionality, and what areas do you think need improvement?
When you showed it to me, it seemed to pick up the wrong drum some of the time, and obviously I
wouldn’t want to use something which didn’t work properly. It seems like a good idea to be able to
record your own drum kit first to train it though. The idea of your teacher being able to design a
lesson plan for each week and for you to go through each bit until you can play it well enough is
good. But definitely, it would have to be accurate about what drum is being played, and it would need
to be able to detect all the drums in a kit and have more complex beats to practice.
52
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Name: Tom Head
Musical experience: Creative music and sound technology student
What do you think of the idea behind the software?
At first I didn’t understand what it was meant to do but when you explained the idea of helping
students practice, it sounds like a great idea, especially if the teacher was able to look back on your
practice record for that week and give you real feedback. I think computers can definitely be helpful
in teaching people to play music.
What about the design and functionality of it?
The design is too minimal, it definitely needs more explanation of what everything on the screen does,
but I know it’s a prototype, so other than that it looks quite neat, for me things like transparency
effects can definitely sell a piece of software! The idea of showing the users playing alongside the real
pattern and colour coding each beat is quite good, I suppose it lets you see at a glance where you
might be going wrong. You mentioned that it was not especially accurate when detecting the drums
and I think this is definitely the main thing that would have to be sorted out – it’s no use if it can’t tell
what you are playing properly! It’s good that you can adjust the tempo of the beat and that you can
have a metronome to play along to or hear the beat as it’s meant to be played.
53
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Name: Ryan Hossaini
Musical experience: Guitar player, no formal tuition
What do you think of the idea behind the software?
It sounds like a good idea, especially if it was part of a program to teach you the drums with example
videos and stuff. I remember trying a program to learn the guitar, but it wasn’t really much better than
a book except for you could hear what you were supposed to be playing, if it was able to tell you how
well you were playing and make you repeat practice exercises until you got it right, it would be really
good for learning I think. The idea of using it to help you practice is good, I can imagine it’d be a bit
of a morale boost if you got through the exercises really quickly and it might encourage you to
practice more.
What about the design and functionality of it?
The design is OK, I definitely think it would be better as part of a tutor program rather than on its
own. It’s good that you can see your playing and if you could get it to always detect the right drum, I
think it would be quite an impressive demonstration of computers helping people learn music.
54
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Appendix C: The pattern file format
Pattern files are stored as XML files, which are easily written and read in C# thanks to the .NET
framework “XMLReader” and “XMLWriter” classes. Each pattern file starts with a standard XML
header (automatically written by the XMLWriter), followed by a <pattern> tag which specifies the
tempo and the number of separate drums contained in the pattern (hardcoded to four in the current
implementation, it would be quite trivial to identify which drums are used by the pattern, allowing
different combinations of drums to be used). Contained within this tag a number of <tick> tags, one
for each tick – in this implementation, a tick is hardcoded to represent a quarter of a beat, but this
could be changed to allow more intricate patterns to be stored. If a specific tick contains one or more
drum hits, these are represented by <beat> tags, specifying which drum is being hit.
As the format is so simple in this prototype, writing an XML schema for the format was not deemed
productive, but below is a simple example of such a pattern file:
<?xml version="1.0" encoding="utf-8"?>
<pattern tempo="130" rows="4">
<tick id="0">
<beat drum="0" />
</tick>
<tick id="1" />
<tick id="2" />
<tick id="3" />
<tick id="4">
<beat drum="1" />
</tick>
<tick id="5" />
<tick id="6" />
<tick id="7">
<beat drum="0" />
</tick>
<tick id="8" />
<tick id="9">
<beat drum="0" />
</tick>
<tick id="10" />
<tick id="11">
<beat drum="1" />
</tick>
<tick id="12" />
<tick id="13" />
<tick id="14" />
<tick id="15" />
</pattern>
55
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Appendix D: Evaluation notes
Timing Tests
Note: an effort was made to filter out erroneous beats detected (if any) from the mean and standard deviation, in order to assess the timing accuracy
differences between sample sizes better.
Initially, each timing test was run three times with an energy sample size of 1024 samples and one second energy history in order to confirm that the results
did not vary significantly between runs, allowing future tests to only involve one run of each test file.
Test Run
Timing test 1, Run 1
Timing test 1, Run 2
Timing test 1, Run 3
Timing test 2, Run 1
Timing test 2, Run 2
Timing test 2, Run 3
Timing test 3, Run 1
Timing test 3, Run 2
Timing test 3, Run 3
Mean
difference (s)
0.012
0.031
0.016
0.018
0.010
0.021
0.012
0.010
0.008
Standard deviation
of differences
0.018
0.008
0.024
0.010
0.006
0.015
0.013
0.011
0.006
Minimum
difference (s)
0.000
0.014
0.000
0.001
0.000
0.001
0.000
0.001
0.001
56
Maximum
difference (s)
0.061
0.045
0.148
0.039
0.033
0.041
0.013
0.060
0.029
Beats correctly
detected
64
64
64
64
64
64
64
64
64
Erroneous
beats detected
0
0
0
0
0
0
0
0
0
Tom Duncalf
Final Year Project: “An e-learning system for playing drums”
Full version of the results of varying the energy sample size
Test Run
Timing test 1, Sample
size = 2048
Timing test 2,
Sample size = 2048
Timing test 3,
Sample size = 2048
Timing test 1, Sample
size = 512
Timing test 2,
Sample size = 512
Timing test 3,
Sample size = 512
Timing test 1,
Sample size = 256
Timing test 1,
Sample size = 128
Mean difference
(s)
0.023
Standard deviation of
differences
0.015
Minimum
difference (s)
0.001
Maximum
difference (s)
0.081
Beats correctly
detected
64
Erroneous beats
detected
0
0.041
0.043
0.000
0.106
256
0
0.024
0.016
0.000
0.066
256
0
0.006
0.010
0.000
0.036
64
3
0.010
0.006
0.001
0.024
256
23
0.005
0.008
0.001
0.011
256
12
0.005
0.012
0.000
0.025
64
29
0.004
0.009
0.000
0.019
64
68
The background noise used in the tests of the effects of the energy history size was generated in Adobe Audition using the noise generation command to
create white noise with an intensity of 15. The envelope of this noise was then varied randomly over the 2 second loop using the volume envelope command,
and finally to simulate a noisy background environment better, TC Works Native Reverb was applied to the noise (with only the effected output of the reverb
retained) to smooth out the envelope, and the resulting sample was mixed with a reverse copy of itself in order to avoid any clicks or sudden amplitude
changes at the end of the loop. The resulting noise varied in volume over the length of the loop, without containing any sudden changes – an extreme
simulation of background noise that may be detected in reality. The volume of the noise mixed with the signal was determined by using the default setting of
a 1 second energy history, and adjusting the noise level until a significant number of erroneous beats were being detected – approximately 32 additional
erroneous beats were detected per 64 input beats.
57
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement