D e p a r t me nto fS i g na lP r o c e s s i nga ndA c o us t i c s L aura L aakso ne n Art ific ialbandw idt he xt e nsio no fnarro w band spe e c h-e nh anc e d spe e c hqual it y and int e l l igibil it y in mo bil ede vic e s Art ific ialbandw idt h e xt e nsio no f narro w band spe e c h-e nh anc e d spe e c h qual it y and int e l l igibil it y in mo bil ede vic e s L a ur aL a a k s o ne n A a l t oU ni v e r s i t y D O C T O R A L D I S S E R T A T I O N S Aalto University publication series DOCTORAL DISSERTATIONS 64/2013 Artificial bandwidth extension of narrowband speech - enhanced speech quality and intelligibility in mobile devices Laura Laaksonen A doctoral dissertation completed for the degree of Doctor of Science in Technology to be defended, with the permission of the Aalto University School of Electrical Engineering, at a public examination held in the lecture hall S1 of the school on May 3, 2013, at 12 noon. Aalto University School of Electrical Engineering Department of Signal Processing and Acoustics Supervising professor Professor Paavo Alku Preliminary examiners Professor Gernot Kubin, Graz University of Technology, Austria Professor Yannis Stylianou, University of Crete, Greece Opponent Professor Bayya Yegnanarayana, International Institute of Information Technology (IIIT), Hyderabad, India Aalto University publication series DOCTORAL DISSERTATIONS 64/2013 © Laura Laaksonen ISBN 978-952-60-5124-6 (printed) ISBN 978-952-60-5125-3 (pdf) ISSN-L 1799-4934 ISSN 1799-4934 (printed) ISSN 1799-4942 (pdf) http://urn.fi/URN:ISBN:978-952-60-5125-3 Unigrafia Oy Helsinki 2013 Finland Abstract Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Laura Laaksonen Name of the doctoral dissertation Artificial bandwidth extension of narrowband speech - enhanced speech quality and intelligibility in mobile devices Publisher School of Electrical Engineering Unit Department of Signal Processing and Acoustics Series Aalto University publication series DOCTORAL DISSERTATIONS 64/2013 Field of research Acoustics and audio signal processing Manuscript submitted 27 September 2012 Date of the defence 3 May 2013 Permission to publish granted (date) 10 January 2013 Language English Monograph Article dissertation (summary + original articles) Abstract Even today, most of the telephone users are offered only narrowband speech transmission. The limited frequency band from 300 Hz to 3400 Hz reduces both quality and intelligibility of speech due to the missing high frequency components that are important cues especially in consonant sounds. Particularly in mobile communications that often takes place in noisy environments, degraded speech intelligibility results in listener fatigue and difficulty in speaker recognition. The deployment of wideband (50–7000 Hz), and superwideband (50–140000 Hz) speech transmission is ongoing, but the current narrowband speech coding will coexist with the new technologies still for years. In this thesis, a speech enhancement method called artificial bandwidth extension (ABE) for narrowband speech is studied. ABE methods aim to improve quality and intelligibility of narrowband speech by regenerating the missing high frequency content in the speech signal, typically in the frequency range 4 kHz–8 kHz. Since the enhanced speech quality is achieved without any transmitted information, the algorithm can be implemented at the receiving end of a communication link, for example in a mobile device after decoding the speech signal. This thesis presents algorithms for artificially extending the speech bandwidth. The methods are primarily designed for monaural speech signals, but also the extension of binaural speech signals is addressed. The algorithms are developed such that they incur reasonable computational costs, memory consumption, and algorithmic delays for mobile communications. These and other implementational issues related to mobile devices are addressed here. The performance of the methods has been evaluated by several subjective tests, including listening-opinion tests in several languages, intelligibility tests, and conversational tests. The evaluations have been mostly carried out with coded speech to provide realistic results. The results from the subjective evaluations of the methods show that artificial bandwidth extension can improve quality and intelligibility of narrowband speech signals in mobile communications. Further evidence of the reliability of the methods has been obtained by successful product implementations. Keywords speech processing, speech enhancement, artificial bandwidth extension, speech quality, mobile device ISBN (printed) 978-952-60-5124-6 ISBN (pdf) 978-952-60-5125-3 ISSN-L 1799-4934 ISSN (printed) 1799-4934 ISSN (pdf) 1799-4942 Location of publisher Espoo Pages 170 Location of printing Helsinki Year 2013 urn http://urn.fi/URN:ISBN:978-952-60-5125-3 Tiivistelmä Aalto-yliopisto, PL 11000, 00076 Aalto www.aalto.fi Tekijä Laura Laaksonen Väitöskirjan nimi Puheen keinotekoinen kaistanlaajennus - parempilaatuista ja ymmärrettävämpää puhetta matkapuhelimiin Julkaisija Sähkötekniikan korkeakoulu Yksikkö Signaalinkäsittelyn ja akustiikan laitos Sarja Aalto University publication series DOCTORAL DISSERTATIONS 64/2013 Tutkimusala Akustiikka ja äänenkäsittelytekniikka Käsikirjoituksen pvm 27.09.2012 Julkaisuluvan myöntämispäivä 10.01.2013 Monografia Väitöspäivä 03.05.2013 Kieli Englanti Yhdistelmäväitöskirja (yhteenveto-osa + erillisartikkelit) Tiivistelmä Suurin osa puhelinliikenteestä on vielä nykyäänkin kapeakaistaista, eli puhelignaalista lähetetään vain 300–3400 Hz:in taajuuskaista. Rajoitettu taajuuskaista huonontaa sekä puheen laatua että ymmärrettävyyttä, koska korkeataajuiset, erityisesti konsonanttiäänteille tärkeät akustiset vihjeet, puuttuvat signaalista. Etenkin meluisissa ympäristöissä matkapuhelimien puhesignaalien heikko ymmärrettävyys väsyttää käyttäjiä sekä aiheuttaa ongelmia puhujan tunnistettavuudessa. Vaikka laajakaistaisen (50–7000 Hz) puheensiirtotekniikan käyttöönotto on aloitettu, kapeakaistaiset puheensiirtomenetelmät ovat käytössä vielä vuosia uusien menetelmien rinnalla. Tässä väitöskirjassa tutkitaan kapeakaistaisen puhesignaalin keinotekoista kaistanlaajennusta. Tällä puheenparannusmenetelmällä pyritään parantamaan puheäänen laatua ja ymmärrettävyyttä lisäämällä puhesignaaliin sisältöä puuttuville taajuuksille, esimerkiksi 4–8 kHz:in taajuuskaistalle. Koska puuttuvan kaistan alkuperäisestä sisällöstä ei lähetetä mitään tietoa, laajennus voidaan toteuttaa puhelinyhteyden vastaanottopäässä, kuten vastaanottajan matkapuhelimessa puhesignaalin dekoodauksen jälkeen. Tässä työssä esitellään keinotekoisia kaistanlaajennusalgoritmeja. Algoritmit on suunniteltu ensisijaisesti monosignaaleille, mutta myös binauraalisen signaalin laajennusta on tutkittu. Algoritmikehityksessä huomioitiin matkapuhelinympäristön asettamat laskenta-, muisti- sekä algoritmiviiverajoitukset. Näitä ja muita menetelmään liittyviä tuotteistusasioita on myös käsitelty tässä tutkimuksessa. Kaistanlaajennusmenetelmien laatua on mitattu useilla subjektiivisiilla testeillä, kuten eri kielillä toteutetuilla kuuntelukokeilla ja keskustelukokeilla. Näissä laadunarvioinneissa on käytetty pääasiassa koodattua puhemateriaalia, jotta tulokset olisivat mahdollisimman todenmukaisia. Laadunarviointitulokset osoittavat, että keinotekoisella kaistanlaajennuksella pystytään parantamaan kapeakaistaisen puheen laatua ja ymmärrettävyyttä matkapuhelinympäristössä. Tätä tulosta tukevat myös algoritmin onnistuneet matkapuhelintoteutukset. Avainsanat puheenkäsittely, puheen siistaus, keinotekoinen kaistanlaajennus, puheen laatu, matkapuhelin ISBN (painettu) 978-952-60-5124-6 ISBN (pdf) 978-952-60-5125-3 ISSN-L 1799-4934 ISSN (painettu) 1799-4934 ISSN (pdf) 1799-4942 Julkaisupaikka Espoo Sivumäärä 170 Painopaikka Helsinki Vuosi 2013 urn http://urn.fi/URN:ISBN:978-952-60-5125-3 Preface This thesis is a story of a long term research collaboration project between Aalto University, Department of Signal Processing and Acoustics, and Nokia. The collaboration in the field of artificial bandwidth extension (ABE) of telephone speech signals, started in 1999, and I have had the opportunity to be part of it since 2002. The story started from an idea of an ABE algorithm. During the years, several ABE algorithms were developed and evaluated by subjective tests. The next step was to implement them in mobile products, and to evaluate the algorithms in realistic conversational context. I have enjoyed this project and learned so much about speech processing technology, scientific research work, and speech quality, thanks to many supporting and delightful people from both the university and Nokia. First of all, I would like to thank Prof. Paavo Alku for giving me the opportunity to work on ABE in a first place. He hired me to the Laboratory of Acoustics and Audio Signal Processing to work on ABE and to write a M.Sc. thesis. In 2003 I started my career in Nokia and the same year, I came up with an idea to start the PhD studies. Fortunately, Paavo welcomed me to be one of his PhD students. Paavo, without your support and encouragement this thesis would not have been finished. I appreciate how you always find time to review, comment, and discuss research work, no matter how busy you are. Another important person behind the ABE collaboration project is Jari Sjöberg, the leader of the Audio Algorithms team in Nokia. Thank you for letting me concentrate on the ABE research, and writing of this thesis. I look forward to returning back to work in September. All the publications of my thesis have been written together with talented people and I wish to thank all my co-authors. From Juho Kontio I learned a lot about neural networks. Thanks to Hannu Pulakka, it has been a pleasure working with you during these years. In addition, thanks to Martti Vainio, Jouni Pohjalainen and Santeri Yrttiaho for your help and participation in the publications. I’m grateful to the pre-examiners, Prof. Gernot Kubin and Prof. Yannis Stylianou for their dedicated work and valuable comments on the 1 Preface manuscript. I would also like to thank Luis Costa for proof-reading the manuscript. Furthermore, I would like to express my gratitude to the Nokia Audio Algorithms team in Tampere. Many of my former and current colleagues have influenced this work. Päivi Valve, from you I learned that it is possible to find a solution for every problem, one way or another. Ville Myllylä, I appreciate our discussions, and your innovative ideas. Riitta Niemistö, I remember you once said, that you believe I can finish my dissertation some day. I have kept that in my mind during the times when I wasn’t quite sure myself. Jukka Vartiainen, thanks for your help in many signal processing questions. In addition, I would like to thank Matti Kajala, Erkki Paajanen, Antti Pasanen, Anu Lahdenpohja, Jouko Salo, and Eero Niemelä for your participation in ABE work. I also wish to thank Anssi Rämö and Henri Toukomaa for listening test arrangements. In Nokia Helsinki site, I have met many great audio people. I would like to thank you for inspiring lunch breaks, parties, and discussions during the years. Especially, I would like to thank my friends Riitta Väänänen, Julia Turku, Jussi Virolainen and Jarmo Hiipakka. I would like to thank many acoustics people from Otaniemi. Conference trips to Philadelphia, Lisbon, and Florence were so much fun because of you. I remember fun dinners in Philadelphia. In Lisbon, the fado concert was amazing. And the bus trip across the Europe in the middle of the night was unforgettable. It’s no wonder I don’t really remember the discussions on my poster after travelling (and being 7 months pregnant) from Frankfurt to Florence by bus because the flights to Italy were cancelled. Finally, I would like to express my gratitude to my family and friends. Thank you Anna for your friendship. Miika, thank you for everything, for being so supportive and positive during this project, and for photoshooting the cover picture for my thesis on a freezing cold winter day. Ellen (6 years), Lotta (4 years) and Lauri (1 year), you are the world to me. I would also like to thank my parents, Elina and Jorma, for all the encouragement and support. Thanks to my sister Eeva, and my brother Eero, and their families for all the good times and laughs. And last but not least, thanks to Merja and Pekka for all the help. Finalizing the thesis while staying at home with three children would not have been possible without the help from my whole family. Espoo, March 22, 2013, Laura Laaksonen 2 Contents Preface 1 Contents 3 List of publications 5 Author’s contribution 7 List of abbreviations 9 List of figures 13 1. Introduction 15 1.1 Aim of the study . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Speech and hearing 17 19 2.1 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Signal characteristics of speech sounds . . . . . . . . . . . . 21 2.2.1 Voiced sounds . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Unvoiced sounds . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Plosives . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Sounds of speech . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Source filter model . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 Linear prediction . . . . . . . . . . . . . . . . . . . . . 25 2.5 Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Binaural hearing and localization . . . . . . . . . . . 3. Digital speech transmission 29 33 3.1 Speech coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Pulse code modulation . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Narrowband speech in cellular networks . . . . . . . . . . . . 36 3 Contents 3.4 Wideband and beyond . . . . . . . . . . . . . . . . . . . . . . 4. Speech quality and intelligibility 37 39 4.1 Subjective quality evaluation . . . . . . . . . . . . . . . . . . 40 4.1.1 Listening-only tests . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Conversational tests . . . . . . . . . . . . . . . . . . . 43 4.1.3 Field tests . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Objective quality evaluation . . . . . . . . . . . . . . . . . . . 44 4.3 Intelligibility tests . . . . . . . . . . . . . . . . . . . . . . . . . 45 5. Artificial bandwidth extension of speech 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 5.1.1 Correlation between narrowband signal and the missing highband . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 General model for artificial bandwidth extension . . . . . . . 49 5.3 Extension of the excitation signal . . . . . . . . . . . . . . . . 51 5.4 Extension of the spectral envelope . . . . . . . . . . . . . . . 53 5.4.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.2 Distance measures . . . . . . . . . . . . . . . . . . . . 55 5.4.3 Codebook mapping . . . . . . . . . . . . . . . . . . . . 56 5.4.4 Linear mapping . . . . . . . . . . . . . . . . . . . . . . 57 5.4.5 Gaussian mixture model . . . . . . . . . . . . . . . . . 58 5.4.6 Hidden markov model . . . . . . . . . . . . . . . . . . 59 5.4.7 Neural networks 60 . . . . . . . . . . . . . . . . . . . . . 6. Artificial bandwidth extension in mobile devices 65 6.1 Artificial bandwidth extension in a mobile device . . . . . . . 65 6.1.1 Signal path in mobile telephony . . . . . . . . . . . . . 68 6.1.2 Acoustic design of a mobile terminal . . . . . . . . . . 69 6.2 Artificial bandwidth extension in car telephony . . . . . . . . 70 7. Summary of the publications 71 8. Conclusions 77 Bibliography 81 Errata 91 Publications 93 4 List of publications This thesis consists of an overview and of the following publications which are referred to in the text by their Roman numerals. I Laura Laaksonen, Juho Kontio, and Paavo Alku. Artificial bandwidth expansion method to improve intelligibility and quality of AMR-coded narrowband speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), volume 1, pages 809–812, March 2005. II Juho Kontio, Laura Laaksonen, and Paavo Alku. Neural networkbased artificial bandwidth expansion of speech. IEEE Transactions on Audio, Speech, and Language Processing, volume 15, issue 3, pages 873– 881, March 2007. III Hannu Pulakka, Laura Laaksonen, Martti Vainio, Jouni Pohjalainen, and Paavo Alku. Evaluation of an artificial speech bandwidth extension method in three languages. IEEE Transactions on Audio, Speech, and Language Processing, volume 16, issue 6, pages 1124–1137, August 2008. IV Laura Laaksonen, Hannu Pulakka, Ville Myllylä, and Paavo Alku. Development, evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal. IEEE Transactions on Consumer Electronics, volume 55, issue 2, pages 780–787, May 2009. 5 List of publications V Laura Laaksonen and Jussi Virolainen. Binaural artificial bandwidth extension (B-ABE) for speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP), volume 1, pages 4009–4012, April 2009. VI Laura Laaksonen, Ville Myllylä, and Riitta Niemistö. Evaluating artificial bandwidth extension by conversational tests in car using mobile devices with integrated hands-free functionality. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Interspeech, pages 1177–1180, August 2011. VII Hannu Pulakka, Laura Laaksonen, Santeri Yrttiaho, Ville Myllylä, and Paavo Alku. Conversational quality evaluation of artificial bandwidth extension of telephone speech. The Journal of the Acoustical Society of America, volume 132, issue 2, pages 848–861, August 2012. 6 Author’s contribution Publication I: “Artificial bandwidth expansion method to improve intelligibility and quality of AMR-coded narrowband speech” The author was the main developer of the algorithm. The author planned the quality and intelligibility tests together with the co-authors. She also processed the samples for the tests. The author primarily wrote the article. Publication II: “Neural network-based artificial bandwidth expansion of speech” The author planned the subjective test and sample processing together with the co-authors. She performed the objective analysis and wrote a considerable part of the article. The author also participated in the development of the algorithm. Publication III: “Evaluation of an artificial speech bandwidth extension method in three languages” The author was the main developer of the algorithm. She planned the subjective listening test together with the co-authors and processed the speech samples for the test. The author wrote the algorithm description in section II for the article. 7 Author’s contribution Publication IV: “Development, evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal” The author was the main developer of the algorithm. She had a significant role in planning and organizing the listening tests. In addition, the author primarily wrote the article except for some parts in sections III.C and IV.B. Publication V: “Binaural artificial bandwidth extension (B-ABE) for speech” The author developed the algorithm and planned the subjective tests together with the co-author. She conducted the listening tests and analysed the results. The author wrote most of the article. Publication VI: “Evaluating artificial bandwidth extension by conversational tests in car using mobile devices with integrated hands-free functionality” The author planned, designed, and conducted the conversational tests with the co-authors. She primarily wrote the article except for section 2.1. Publication VII: “Conversational quality evaluation of artificial bandwidth extension of telephone speech” The author planned and designed the conversational tests together with the co-authors. She also participated in piloting the tests. She is the main developer of the method ABE1 and implemented the algorithm for the test. She wrote sections I, II A, and parts of III for the article. 8 List of abbreviations 2G second generation 3G third generation 3GPP 3rd Generation Partnership Project ABE artificial bandwidth extension ACELP algebraic code excited linear prediction ACR absolute category rating ADPCM adaptive differential pulse code modulation AMPS Advanced Mobile Phone System AMR adaptive multirate AMR-WB adaptive multirate wideband B-ABE binaural artificial bandwidth extension CCR comparison category rating CEPT European Conference of Postal and Telecommunications Administrations CMOS comparison mean opinion score CS-ACELP conjugate structure algebraic code-excited linear prediction DCR degradation category rating DFT discrete Fourier transform DMOS degradation mean opinion score 9 List of abbreviations DRT diagnostic rhyme test EFR enhanced full rate EM expectation maximization ETSI European Telecommunications Standards Institute FFNN feedforward neural network FFT fast Fourier transform GMM Gaussian mixture model GSM Global System for Mobile Communications HD high definition HMM hidden Markov model HRTF head related transfer function ILD interaural level difference IMS internet protocol multimedia system ITD interaural time difference ITU-T International Telecommunication Union – Telecommunication Standardization Sector LD-CELP low-delay code excited linear prediction log-PCM logarithmic pulse code modulation LP linear prediction LPC linear predictive coding LSD log spectral distortion LSF line spectrum frequency, LSP line spectrum pair LTE Long Term Evolution MFCC mel frequency cepstral coefficient MI mutual information MIPS millions of instructions per second 10 List of abbreviations MLP multilayer perceptron MMSE minimum mean square error MOS mean opinion score MRT modified rhyme test NEABE neuroevolution artificial bandwidth extension NEAT neuroevolution of augmenting topologies NMT Nordic Mobile Telephone PCM pulse code modulation PDF probability density function PESQ perceptual evaluation of speech quality POLQA perceptual objective listening quality analysis PSTN public switched telephone network RPE-LTP regular pulse excitation with long term prediction SD spectral distortion SNR signal-to-noise ratio SRT speech reception threshold TFO tandem-free operation TrFO transcoder-free operation UMTS Universal Mobile Telecommunications System VoIP voice over internet protocol VoLTE voice over Long Term Evolution VSELP vector sum excited linear prediction WCDMA wideband code division multiple access 11 List of abbreviations 12 List of figures 2.1 Organs involved in speech production. . . . . . . . . . . . . . 20 2.2 Simplified glottal pulse waveform. . . . . . . . . . . . . . . . 21 2.3 Typical voiced speech sound presented as a time domain waveform, and its amplitude spectrum. . . . . . . . . . . . . 22 2.4 Typical unvoiced speech sound presented as a time domain waveform, and its amplitude spectrum. . . . . . . . . . . . . 23 2.5 Typical plosive presented as a time domain waveform, and its amplitude spectrum. . . . . . . . . . . . . . . . . . . . . . 24 2.6 Block diagram of a source filter model of speech production. 25 2.7 Vowel windowed with a rectangular and a Hann window. . . 27 2.8 LPC residual signal for a vowel. . . . . . . . . . . . . . . . . . 27 2.9 Amplitude spectrum and an LPC spectrum of a vowel. . . . 28 2.10 Schematic illustration of the human ear. . . . . . . . . . . . . 29 2.11 Coordinate system used to describe the position of a sound source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.12 Binaural sound production through headphones. . . . . . . . 31 3.1 Average speech spectrum. . . . . . . . . . . . . . . . . . . . . 34 5.1 General model for ABE. . . . . . . . . . . . . . . . . . . . . . 50 5.2 ABE based on LPC. . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Extension of the excitation by cosine modulation. . . . . . . 51 5.4 HMM with five states. . . . . . . . . . . . . . . . . . . . . . . 59 5.5 Feedforward neural network. . . . . . . . . . . . . . . . . . . 61 5.6 Schematic diagram of neuroevolution methods. . . . . . . . . 62 6.1 Different phone call scenarios between narrowband and wideband mobile devices. . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 Receiving frequency mask. . . . . . . . . . . . . . . . . . . . . 69 13 List of figures 7.1 Main results of the CCR language test. . . . . . . . . . . . . 72 7.2 Signal path from the far-end user to the near-end user in a telephone conversation between two mobile phone users. . . 73 7.3 Teleconferencing system including a conference bridge and 14 a terminal with B-ABE function. . . . . . . . . . . . . . . . . 74 7.4 Mobile devices in the test car. . . . . . . . . . . . . . . . . . . 75 7.5 Schematic illustration of the conversation test setup. . . . . 75 1. Introduction Speech signals in telephone communications have been bandlimited since the beginning of the history of the telephone. During the days of analogue telephone, the limited bandwidth was due to the physical restrictions of acoustic components and the bandwidth capacity. A digital transmission utilizing pulse code modulation (PCM) adopted a 8 kHz sampling rate and the speech bandwidth of 300–3400 Hz both for compatibility with the analogue telephone and also for reasons of bandwidth capacity [1]. For decades, consumers were offered only narrowband (also called telephone band) speech transmission. The telephone users got used to the telephone speech that sounds muffled and has reduced speech quality [2] and intelligibility [3], especially during consonants, due to the missing important high frequency acoustic cues. The narrowband PCM quality may have been adequate for landline telephony in the 20th century, but mobile communications has brought new challenges and demands for speech transmission. In the 1990’s, the number of mobile phones started to increase rapidly. The first coders designed for the 2G mobile networks suffered from degraded speech quality compared to the narrowband PCM. Later, the enhanced full rate (EFR) [4] and adaptive multirate (AMR) [5] codecs reached nearly the narrowband PCM quality. The first significant improvement to the speech quality and intelligibility was achieved by increasing the speech bandwidth and the sampling rate of a speech codec. The adaptive multirate wideband (AMRWB) speech codec with a frequency band of 50–7000 Hz was standardized in 2001 [6], and its deployment started in 2009 [7]. Still today, only a small portion of end users in mobile telephony are offered wideband transmission that is marketed as high definition (HD) voice. The upgrade of networks and mobile devices for AMR-WB support is time consuming. On the other hand, in voice over internet protocol (VoIP) applications wideband 15 Introduction or superwideband speech is often supported, for example in [8]. Speech transmission in mobile networks is characterized by the fact that mobile phones can be used everywhere. Background noise conditions may vary from quiet to extremely noisy and complex acoustic surroundings. From a speech quality perspective, the small mechanical components, i.e. earpieces and microphones, as well as the variety of possible bluetooth, car and other accessories that are used with the mobile devices are also a challenge. To face these challenges, a proper acoustical design of the device and speech enhancement methods that modify the speech signal at both ends of the telephone link are needed. Speech enhancement algorithms aim to improve the quality and intelligibility of speech, for example, by reducing noise and echo from the signal or by emphasizing perceptually important parts of the signal. Noise cancellation and single-channel and multichannel noise reduction techniques are examples of such algorithms that are important in mobile communications. These methods are applied in modern mobile devices and networks. Motivated by the slow deployment of wideband speech, one of the speech enhancement research topics since the mid 1990’s has been artificial bandwidth extension (ABE). Narrowband speech transmission and coding uses a sampling rate of 8 kHz that restricts the speech bandwidth to 300– 3400 Hz. ABE methods aim to improve quality and intelligibility by regenerating the missing high frequency content of a speech signal in the receiving end of the transmission. An ABE method increases the sampling rate, typically from 8 kHz to 16 kHz, and adds new frequency components to the highband, i.e. typically a frequency range of 4–8 kHz. The extension is completely artificial, indicating that no information related to the missing highband is transmitted. However, there also are methods that are not completely artificial but utilize transmitted side information in the extension procedure [9, 10, 11]. ABE can be seen as a speech enhancement method for narrowband speech signals that improves quality and intelligibility. Especially in noisy environments, the wider bandwidth is beneficial. On the other hand, ABE can be seen as an algorithm that transforms a narrowband signal to wideband in the receiving terminal when wideband transmission is not available. During the transition phase from narrowband to wideband speech transmission, the speech bandwidth may vary between and even during phone calls, depending on the available network conditions and telephone devices. The challenge in ABE methods is to generate as natural sound- 16 Introduction ing wideband speech as possible. While pursuing this goal, some unnatural highband artefacts might be created in the signal that might annoy listeners. As for all speech enhancement methods, artificial bandwidth extension should be as transparent to end users as possible. Subjective speech quality assessment is extremely important in the field of ABE research. Both underestimation and overestimation of the signal level in the artificial highband is likely to produce an audible artefact. Especially fricatives and plosives, which are characterized by a burst of frication, are extremely sensitive to unnatural signal components in the highband, because they have a considerable amount of energy in frequencies above 4 kHz. For implementing an ABE method in a mobile device, thorough testing and evaluation of the method is needed. The interoperability over the whole signal path, including other speech enhancements and acoustical properties of the device, has to be taken into account in the implementation of the ABE feature. 1.1 Aim of the study This thesis studies artificial bandwidth extension methods for narrowband speech signals. Most of the research work has been carried out within a collaboration research project between Nokia and the Department of Signal Processing and Acoustics of the Aalto University. The project on artificial bandwidth extension started in 1999, but the work related to this thesis has been conducted during the years 2003–2011 in Nokia. The thesis addresses four main research topics: 1. Development of ABE algorithms for narrowband telephone speech that are robust with respect to noisy and distorted (due to speech coding) input signals 2. ABE of binaural speech signals 3. Comprehensive evaluation of ABE methods by subjective listening-only tests and conversational tests 4. Implementation of an ABE method in a mobile device The algorithm development includes new algorithms that improve speech 17 Introduction quality and intelligibility of narrowband telephone speech. These methods are presented in Publication I, Publication II, Publication III and Publication VII. Furthermore, the extension of binaural signals is addressed in Publication V. The designed methods are evaluated by several listening opinion tests, including, for example, listening tests addressing potential language dependency of the algorithm (Publication III) and conversational tests (Publication VI and Publication VII). Finally, the implementation of an ABE method in a mobile device is discussed in Publication IV. 18 2. Speech and hearing 2.1 Speech production In everyday life we say that we speak with our mouths. In fact, the mouth is needed for speech communication, but by itself it is not at all adequate for speech production. Actually, speech production starts from the lungs, where the lung pressure is increased by drawing air into the lungs. While maintaining some pressure in the lungs, air is forced from the lungs and passed through the trachea, larynx, and the vocal tract. The organs involved in speech production are shown in the schematic illustration of figure 2.1. The larynx is a sophisticated organ that controls the air flow from the lungs. In the larynx, two horizontal ligaments called vocal folds are attached posteriorly to the arytenoid cartilages that, in turn, are used to control the size of the opening of the vocal folds. This opening is called the glottis. The vocal tract is an acoustic tube that starts from the larynx, ends at the lips, and consists of the pharyngeal cavity and the oral cavity. The total length of the vocal tract from the larynx to the lips is about 17 cm for an adult male and 13.5 cm for a female [12]. By changing the length and the cross section profiles of the vocal tract, mostly by moving the lips, jaw, tongue and velum, humans are able to produce different speech sounds. The influence of the vocal tract on the speech sound is called articulation. The velum separates an ancillary path, the nasal tract, from the vocal tract for sound transmission. It starts from the velum and ends at the nostrils. During nasal speech sounds this path is opened, whereas during non-nasal sounds the velum is tightly drawn up. There are three main ways how humans use the speech production or- 19 Speech and hearing ←− nasal cavity oral cavity −→ tongue −→ larynx −→ ←− velum ←− vocal folds and glottis ←− trachea ←− lung Figure 2.1. Organs involved in speech production. gans to produce different speech sounds [13]. These three ways result in the speech sound categories: 1) voiced sounds, 2) unvoiced sounds, and 3) plosives (strictly speaking, in the classification of [13], plosives can be either voiced or unvoiced). Voiced sounds refer to quasi-periodic sounds, such as vowels and nasals, during which the air from the lungs travels through the larynx, where two vocal folds start to open due to the increased air pressure. After a complete opening, the vocal folds start to close due to the Bernoulli effect until they are completely closed. The resulting quasi-periodic signal is the source signal for voiced sounds and is called a glottal pulse. The production of a sound in this manner is called phonation [14]. A simplified glottal pulse waveform is shown in figure 2.2. The periodicity originates from the vibrating vocal folds that open and close regularly. The vocal folds are completely open at the maximum amplitude of the glottal pulse waveform and closed at the minimum. The round shape of the glottal pulse can be explained by the watery tissue of the vocal folds, and this round waveform shape results in low-pass characteristics in the frequency domain. The fundamental frequency, f0 , and consequently the perceived pitch of speech is determined by the vibration rate of the vocal folds. For females, f0 is typically about 200 Hz and for males, 120 Hz. In the frequency domain, the spectrum of the glottal pulse has a comb- 20 Amplitude Speech and hearing 1 0 5 10 15 Time [ms] 20 25 Figure 2.2. Simplified glottal pulse waveform. shaped structure that shows the fundamental frequency and its harmonics. During the production of voiced sounds the vocal tract is either completely or partly open. The vocal tract acts as an acoustic filter that creates resonances called formants. Since the parameters of the vocal tract as an acoustical filter are continuous and distributed over the entire tract, the resulting transfer function depends on the overall length, shape, and volume of the vocal tract rather than just a single parameter. During unvoiced sounds that refer to noise-like sounds, such as fricatives, the vocal folds are almost completely open, and no periodic glottal pulse is created. Instead, the source signal is noise generated by a turbulent air flow through a constriction in the vocal tract. The constriction is formed, for example, by the tongue behind the teeth. The noise source is further modified by the resonances of the vocal tract and radiated from the mouth [14]. During plosives, the vocal tract is completely closed at some place, for example by the lips, and the air flow is blocked. When the obstruction is suddenly opened, the released airflow from the lungs produces a sudden impulse in pressure causing a short, audible sound with a noise-like waveform. 2.2 2.2.1 Signal characteristics of speech sounds Voiced sounds Western languages comprise mostly voiced sounds. For example, about 78 % of speech sounds in standard English have been reported to be voiced [14]. A typical voiced speech sound is shown in figure 2.3 both as a timedomain waveform and as a frequency-domain spectrum. The waveform is characterized by a periodic structure and a large variation in amplitude. 21 Speech and hearing Amplitude 1 0 Amplitude spectrum [dB] -1 0 10 20 30 Time [ms] 120 80 40 0 2 4 Frequency [kHz] 6 8 Figure 2.3. Typical voiced speech sound (Finnish [a]) presented as a time domain waveform (top), and its amplitude spectrum (bottom). The spectrum has been computed with a 1024-point FFT using a Hann window. The amplitude spectrum shows the clear harmonic structure, especially at low frequencies. The first harmonic corresponds to the fundamental frequency. The formant frequencies can also be identified from the maximum peaks of the spectral envelope. The spectrum of voiced sounds typically has low-pass characteristics, which originates from the excitation signal, i.e. the glottal pulse. In a narrowband signal, due to the low-pass characteristics, a great deal of the energy of a voiced sound is preserved despite the restricted bandwidth. In addition, a narrowband signal also contains the most important harmonics. Although the fundamental frequency may be missing, the human ear is still able to hear the pitch correctly, a phenomenon called the missing fundamental. From the bandwidth extension point of view, the most important aim is to extend the low-pass envelope in the higher frequencies. This low-pass characteristics originating from the glottal pulse is also one of the justifications for the correlation between the narrowband speech signal and missing high frequencies. On the other hand, the exact regeneration of the harmonic structure at higher frequencies is not as perceptually important [15]. 22 Speech and hearing Amplitude 1 0 -1 Amplitude spectrum [dB] 0 10 20 30 Time [ms] 120 80 40 0 2 4 Frequency [kHz] 6 8 Figure 2.4. Typical unvoiced speech sound (Finnish [s]) presented as a time domain waveform (top), and its amplitude spectrum (bottom). The spectrum has been computed with a 1024-point FFT using a Hann window. 2.2.2 Unvoiced sounds A waveform of a typical unvoiced speech sound is shown in figure 2.4. The small amplitude values and rapid changes of direction in the temporal signal waveform are due to the noisy source signal. The amplitude spectrum in the lower part of figure 2.4 increases with frequency, indicating that unvoiced sounds are characterized by high frequency components. It is evident that a large portion of the energy of fricative sounds is missing from narrowband signals. These speech sounds are especially challenging for ABE methods, since natural sounding wideband fricatives are obtained only if an adequate amount of energy is added to the higher frequencies. On the other hand, misplaced strong frequency components in high frequencies are likely to result in severe artefacts. 2.2.3 Plosives A waveform and an amplitude spectrum of a typical plosive sound is presented in figure 2.5. Plosives are characterized by a short silent period caused by a break in voicing, followed by a short burst of frication as the pressure in the place of constriction is suddenly released in the vocal 23 Amplitude Speech and hearing 1 0 Amplitude spectrum [dB] -1 0 10 20 30 Time [ms] 120 60 40 0 2 4 Frequency [kHz] 6 8 Figure 2.5. Typical plosive (Finnish [k]) presented as a time domain waveform (top), and its amplitude spectrum (bottom). The spectrum has been computed with a 1024-point FFT using a Hann window. tract. After the burst comes a voicing period that leads to the following vowel. In addition to unvoiced sounds, plosives are also challenging for ABE methods. If the amplitudes of the added high frequency components of a plosive are too large, it is easily perceived as a tingle. 2.3 Sounds of speech Even though speech as an acoustic signal is a continuous waveform, in a linguistic domain it comprises a finite number of discrete distinguishable sounds called phonemes. A person who knows a certain language identifies these phonemes because they have a distinctive linguistic function, although the acoustic properties of signals representing a certain phoneme are both speaker and context dependent. A classification of speech sounds often starts from a division into vowels and consonants. Then the consonants and vowels are further classified according to the manner and place of articulation. For example, the Finnish vowels are /a, e, i, o, u, y, ä, ö/. The Finnish consonants are classified as plosives /k, p, t, g, b, d/, fricatives /f, s, h/, nasals /n, m, η /, trills /r/, laterals /l/, and semivowels /j, v/. 24 Speech and hearing f0 Voiced excitation Filter coefficients G (glottal pulse) + × Linear filter Speech signal Unvoiced excitation (noise) SOURCE FILTER Figure 2.6. Block diagram of a source filter model of speech production. The fundamental frequency, f0 , can be given as a parameter for the voiced excitation. The gain control, G, is for controlling the energy level of the signal. 2.4 Source filter model Speech production can be modelled with a source-filter model [16]. According to the model, the two parts, namely the source model and the filter model, are independent. Although the assumption is not completely justified, since there is some interaction between the glottal source and the filter, the model usually yields adequate results. The model consists of two alternative sources, a quasi-periodic pulse generator modelling the glottal pulse for voiced sounds and a noise signal modelling a constriction in the vocal tract for unvoiced sounds, as shown in figure 2.6. The fundamental frequency can be given as a parameter to the pulse generator. The gain control, G, is needed to control the energy level of the signal. The vocal tract and the nasal tract are modelled independently of the source signal by a linear time-varying filter. 2.4.1 Linear prediction The source-filter model has been applied to many areas of speech processing, for instance in speech analysis, speech synthesis, and speech coding. In speech coding, so called vocoders utilize the source filter model to reduce the number of parameters needed to characterize speech sounds. The parametrization of the source and the filter separately, instead of the whole waveform, has been an effective way to reduce the bit rate in speech coding. The vocal tract filter and the excitation signal can be estimated 25 Speech and hearing from a speech signal using a well known technique called linear prediction (LP), or linear predictive coding (LPC) [17]. LP is based on the simple idea that the next signal sample, s(n), can be estimated as a linear combination of the p previous samples: ŝ(n) = p a(k)s(n − k). (2.1) k=1 The parameters a(k) are unknown, and they are solved by minimizing the mean square of the energy of the error signal, e(n), between the real sample s(n) and the estimate ŝ(n). The error signal, also called the residual, can be written as e(n) = s(n) − ŝ(n) = s(n) − p a(k)s(n − k). (2.2) k=1 As a result from an autocorrelation method, the optimal LPC prediction coefficients A = ( a(1) a(2) . . . a(p) )T are obtained from A = R−1 · R , (2.3) where R is an autocorrelation matrix of the form ⎛ ⎜ ⎜ ⎜ ⎜ R=⎜ ⎜ ⎜ ⎝ ⎞ R(0) R(1) R(1) .. . R(0) .. . . . . R(p − 1) ⎟ ⎟ . . . R(p − 2) ⎟ ⎟ ⎟ .. .. ⎟ . . ⎟ R(p − 1) R(p − 2) . . . (2.4) ⎠ R(0) and R is an autocorrelation vector R = R(1) R(2) . . . R(p) T . (2.5) LPC analysis and synthesis The LPC analysis parametrizes an input speech signal with a residual signal, p LPC coefficients, and a gain factor, G. First, the order of LPC, i.e. the parameter p, the frame size, and the window function have to be defined. Typical window sizes are about 10–30 ms, a period of time where features such as mean power, frequency spectrum, and probability density distribution may be considered to remain relatively constant [12]. As an example, a 20-ms frame of the Finnish speech sound [a], windowed with a rectangular and a Hann window, are shown in figure 2.7. For the windowed speech frame, the autocorrelation parameters, a(k), are computed from equation 2.3. The residual signal shown in figure 2.8 26 Speech and hearing 0.6 0.4 s(n) 0.2 0 -0.2 -0.4 0 5 10 Time [ms] 15 20 Figure 2.7. 20-ms frame of vowel [a], windowed with a rectangular (thin line) and a Hann window (bold line). 0.6 0.4 e(n) 0.2 0 -0.2 -0.4 0 5 10 Time [ms] 15 20 Figure 2.8. LPC residual signal, e(n), for vowel [a] is calculated from the signal in figure 2.7. is obtained by filtering the original un-windowed frame with the obtained inverse filter of the form A(z) = 1 − p a(k)z −k . (2.6) k=1 Finally, the gain factor, G, is calculated from the original frame. On the synthesis side, the original speech signal is reconstructed by filtering the residual signal with an LPC synthesis filter of the form H(z) = 1− p 1 k=1 a(k)z −k . (2.7) The LPC synthesis filter models the spectral envelope of the signal. In practice, the order of the LPC, p defines how accurately the filter models the overall spectral shape, the formants, and the fine structure of the signal. Therefore, the parameter p is also directly proportional to the 27 Amplitude spectrum [dB] Speech and hearing 120 100 80 60 40 0 2 4 6 Frequency [kHz] 8 Figure 2.9. Amplitude spectrum of vowel [a] (thin line) and an LPC spectrum of order 12 (bold line). prediction gain that is often used to measure the performance of LPC: a higher LPC order yields a more accurate LPC model. Figure 2.9 shows the amplitude spectrum of vowel [a] and the corresponding LPC spectrum of order 12. 2.5 Hearing The human ear, shown schematically in figure 2.10, can be divided into the outer ear, the middle ear, and the inner ear. The outer ear consists of the pinna and the ear canal which terminates at the eardrum. The pinna protects the ear canal but also facilitates the localization of sound sources [14]. The ear canal acts as an acoustic tube having its first resonance at about 4 kHz. The eardrum transforms the pressure variations of incoming sound waves into mechanical vibrations [18]. In the middle ear, the mechanical vibrations are transmitted to the inner ear by three bones, the ossicles (the malleus, the incus and the stapes). The ossicles perform an impedance transformation between the air medium of the outer ear and the liquid of the inner ear. At the oval window, where the middle ear ends, the pressure is about 30 times the pressure at the eardrum. Between the middle ear and the oral cavity is the Eustachian tube that equalizes the pressure between the middle ear and the outer ear during, for example, swallowing or during air pressure changes due to rapid change in altitude. In the inner ear, the spiral cochlea begins at the oval window, and it transforms the vibrations into properly coded neural impulses. Vibrations arriving at the cochlea make the basilar membrane vibrate. Next to 28 Speech and hearing stapes and oval window incus semicircular canals malleus auditory nerve cochlea eardrum ear canal pinna eustachian tube Figure 2.10. Schematic illustration of the human ear. the basilar membrane, is the organ of Corti that contains hair cells, approximately 20000–30000 hair cells in several rows. The movement of the basilar membrane activates the hair cells that, in turn, excite neurons in the auditory nerve. 2.5.1 Binaural hearing and localization Communication systems have traditionally used monaural hearing, where sound is perceived only by one ear or by two ears with no difference between the signals heard by the two ears. However, when sounds are received by two ears, the human hearing system also analyses the spatial and localization information in the signals. This binaural hearing can be beneficial in many situations, like noisy, complex auditory environments. An example of the benefits of binaural hearing is the well-known effect called the "cocktail party effect". It refers to the fact that human listeners can concentrate on listening to one speaker when others are talking simultaneously or when background noise is present [19]. From the speech transmission point of view spatial audio could be beneficial, for example, in teleconferencing systems. The participants of a teleconference might be virtually placed at different positions around the listener. Since the performance of 3D audio is dependent on the bandwidth of the signal, ABE methods for binaural signals could be applied when wideband speech transmission is not available. 29 Speech and hearing frontal plane median plane d δ φ horizontal plane Figure 2.11. Coordinate system used to describe the position of a sound source with the listener placed at the origin. The position of a sound source can be defined by the azimuth, φ, the elevation, δ, and the distance, d. Sound source localization The position of a sound source is often described with the coordinate system shown in figure 2.11 [19]. When the listener is placed at the origin of the coordinate system, the position of a sound source can be defined by the three attributes the azimuth, φ, the elevation, δ, and the distance, d. The median plane cuts the head of the listener in two symmetrical halves, the horizontal plane is defined by the interaural axis and the lower margins of the eye sockets, and the frontal plane is orthogonal to the horizontal plane intersecting the interaural axis. The existence of two ears is the main reason behind the ability of human listeners to identify the direction of a sound source. The auditory system analyses many temporal and spectral cues from the signals received by the ears. If a sound source is not located in the median plane, the sound signal arrives earlier to the nearer than to the farther ear. In other words, there is a time difference between the signals, which is usually referred to as the interaural time difference (ITD). In addition, the sound shadow of the head attenuates the signal on its way to the farther ear. This intensity difference between the two ears is called the interaural level difference (ILD). The maximum ITD, of about 700 μs, and the maximum ILD, of about 6 dB, are achieved when the sound source is located in the horizontal plane in the direction of φ = ±90◦ . ITD and ILD are considered the main cues for the localization of sound 30 Speech and hearing yl yr Hl Hr x Figure 2.12. Binaural sound production through headphones. The left and the right channels are obtained by filtering the mono signal, x, with the HRTF filters Hl and Hr . sources, but also the asymmetry of the outer ear, head and torso contribute to the directional hearing. For instance, the front-back separation is possible mainly because the pinna changes the spectral characteristics of the sound differently depending on whether the sound source is in front of or behind the head. Furthermore, the detection of the elevation angle is possible due to the asymmetry of the outer ear and the reflections from the shoulders. The precision of the localization of a sound source, i.e. the localization blur, depends not only on the azimuth and elevation angles but also on the frequency and bandwidth of the sound signal. To generalize, the localization precision is better for low frequencies, whereas at around 1500 Hz, the precision is much worse because the signal wavelength is comparable to the size of the head, and the ITD is not a valid cue anymore. In addition, the localization blur is smallest when the sound source is in front of the listener, at position φ = 0◦ . When the sound source is located at positions φ = ±90◦ , the localization blur is at its maximum [20]. A head-related transfer functions (HRTF) can be used to describe how an ear receives a sound. The HRTF is a transfer function from a point sound source to the ear measured in a free field. Typically, the HRTFs are measured in unechoic chambers using a dummy head. 3D sound 3D sound refers to an attempt to reproduce sound through loudspeakers or stereo headphones to a listener creating an illusion of a natural envi- 31 Speech and hearing ronment and sound sources. With headphones, the perception of a sound depends directly on the signals that are brought to the ears. Without any differences or with only time and level differences, the sound is localized inside the head of the listener. This effect, called lateralization [21], is due to the fact that all the cues induced by the outer ears and the head are missing. Therefore, to create 3D sound with headphones, these cues should obviously be included in the signals. This can be achieved by processing the signals with the corresponding HRTFs, as shown in figure 2.12. The left channel, yl , is obtained by filtering a mono input signal, x, through the left HRTF filter, Hl , and the right channel, yr , is obtained by filtering the input signal, x, through the right HRTF filter, Hr , respectively [20]. Achieving the same spatial impression in loudspeaker listening, requires further processing. The HRTFs have to be modified to compensate the crosstalk from the other loudspeaker to the farther ear. 32 3. Digital speech transmission The telephone was invented in 1876 by Bill Graham Bell. According to Oliver Lodge, Graham Bell’s telephone "took the sounds emanating from the human voice, and sought to reproduce them at a distance by electrical means; so that at any, even great, distances, aerial vibrations could be reproduced and interpreted by the human ear, after the same fashion as the original aerial vibrations could be interpreted by an ear close to" [22]. Since the early days of the telephone, there have been limitations that have prevented this aim from being completely achieved. One of the limitations has been the bandwidth of the transmitted speech signal resulting in much worse speech quality than in face-to-face conversation. The human voice contains frequencies from 20 Hz to 20000 Hz, whereas the traditional telephone bandwidth, also called narrowband, only covers the range from 300 Hz to 3400 Hz. The limited bandwidth reduces speech quality [23] and intelligibility [24]. Furthermore, speaker recognition becomes more difficult due to the limited frequency band. The used telephone bandwidth originates from analogue telephony, where the bandwidth was limited due to the characteristics of the transducers and other hardware, and due to the analogue frequency-division multiplexing with a frequency grid of 4 kHz [25]. Digital speech transmission was built around the same principles as the analogue system for compatibility reasons, and for decades, consumers were offered only narrowband speech transmission. Recently, also wideband speech transmission (50-7000 Hz) has become available in VoIP applications and gradually also in cellular networks. So far, HD voice has been launched on 41 mobile networks in 33 countries [26]. The transition from narrow- 33 Amplitude spectrum [dB] Digital speech transmission 120 110 100 90 80 NB WB 70 60 SWB 0 2 4 6 8 10 12 14 Frequency [kHz] 16 18 20 Figure 3.1. Average speech spectrum calculated from a 10-s long speech sample of a male speaker. The narrowband (NB) bandwidth (300-3400 Hz) is coloured red, the wideband (WB) bandwidth (50-7000 Hz) is coloured blue, and the superwideband (SWB) (50-14000 Hz) yellow. band communications to wideband in cellular networks continues, but it will take years, because in addition to the networks, also the terminals have to be upgraded to wideband compatible ones. In addition to wideband speech transmission, also superwideband (50–14000 Hz) and fullband (20–20000 Hz) speech transmission are being developed [27]. An average speech spectrum calculated from a speech sample of a male speaker and the different telephone bandwidths are illustrated in figure 3.1. 3.1 Speech coding At present, speech is mostly transmitted in digital format in telecommunication networks, such as the public switched telephone network (PSTN), digital cellular networks, and VoIP networks. For digital transmission, the analogue speech signal has to be represented in a digital form, and for this reason speech coding is needed. In other words, speech coding aims to represent the speech signal in a digital form with as few bits as possible while maintaining adequate speech intelligibility and quality for the application in mind. In addition to the bit rate and quality, speech coders can be characterized by their complexity and the delay they introduce [27]. The desired bit rate of a speech codec is determined by the channel capacity of the application. There is a trade off between the bit rate and the voice quality and intelligibility [28]. In some applications, having a coder with multiple bit rates is also desirable. The coder may change the bit rate on the fly depending on the available channel capacity [27]. The quality of a speech coder is usually expressed as a mean opinion score (MOS) value. It is a five-point scale from 1 to 5 (bad, poor, fair, good, 34 Digital speech transmission excellent) that is obtained from a subject as a result of a subjective listening test or estimated using an objective measure. It should be noted that MOS values obtained from separate listening tests or objective evaluations should not be directly compared with each other. Especially, care should be taken when comparing the quality of speech signals of different bandwidths. The complexity of a speech coder is usually represented as the computational requirement (millions of instructions per second, MIPS) and memory consumption. The target is to minimize the complexity, as it directly affects the cost and energy usage of an application [27]. According to [29], in a telephone conversation, the end-to-end delay from the far-end user’s mouth to the near-end user’s ear is desired to be less than 150 ms. This is regarded as the limit for transparent interactivity, and greater delays hinder the conversation. However, a study reported in [30] suggests that a much bigger delay could be tolerated in a two-way conversation. The speech coder is one of the processing steps in the whole end-to-end processing chain, and it contributes directly to the delay. 3.2 Pulse code modulation Pulse code modulation (PCM) coding became a coding standard for PSTN networks in the 1970’s. PCM is a waveform coding method that aims to represent the time-domain speech waveform as accurately as possible. It uses a sampling rate of 8 kHz to sample analogue speech signals. The PCM speech bandwidth (300–3400 Hz) was adapted from analogue telephone systems, and it is specified in [1]. A non-linear quantization called A law (Europe, Africa, Australia, South America) or μ law (North America and Japan) is used. Both quantization laws are logarithmic such that more quantization levels are reserved for low amplitude values. The quantized amplitude values are then encoded with 8 bits, which results in the logarithmic PCM (log-PCM) coding with the 64-kbit/s bit rate, as specified in the International Telecommunication Union – Telecommunication Standardization Sector (ITU-T) standard G.711 [31]. The log-PCM achieves a MOS value of 4.3 that is called the toll quality. Another variant of PCM is the adaptive differential PCM (ADPCM) at 16/24/32/40 kbit/s [32]. 35 Digital speech transmission 3.3 Narrowband speech in cellular networks The first generation cellular networks, such as the Nordic Mobile Telephone (NMT) in Scandinavia and the Advanced Mobile Phone System (AMPS) in America, were analogue systems. Various standards existed around the world, and different mobile phones were needed for each standard. The speech quality in analogue systems is dependent on the signal quality. For example, the distance between the mobile phone and the base station impacts the quality directly so that when this distance increases the quality drops. In the mid 1980’s, the work on the European standard for a digital mobile cellular networks was started by the European Conference of Postal and Telecommunications Administrations (CEPT), and the work was later moved to the European Telecommunications Standards Institute (ETSI). The task was known as the Groupe Spécial Mobile (GSM), but the work was later renamed as Global System for Mobile Communications [33]. Since the commercial launch of GSM in the 1990’s, the number of GSM and other mobile subscribers began to increase rapidly, reaching 5.6 billion mobile connections in 2011 [34]. Starting from the 1980’s, the research focus in speech coding was on developing narrowband low-rate coders for cellular and military communication [35]. The starting point for the second generation coders was the narrowband PCM coded signal. The work on speech coding aimed at developing speech coding algorithms with a much lower bit rate than 64 kbit/s yet having adequate speech quality and tolerable complexity for mobile communications. The first codec for the GSM, RPE-LTP (regular pulse excitation with long term prediction), is based on LPC analysis and operates at a bit rate of 13 kbit/s. The MOS value of the RPE-LTP codec is approximately 3.5, which is significantly lower than that of PCM. In addition to the RPE-LTP codec, many other speech codecs were standardized for GSM, Universal Mobile Telecommunications System (UMTS), and other 2G and 3G cellular networks. For example, the GSM Enhanced Full Rate (EFR) codec operating at 12.2 kbit/s, is based on an LPC-based method called algebraic code excited linear prediction (ACELP). Another example is a half rate coder called the vector sum excited linear prediction (VSELP) coder with bit rate of 5.6 kbit/s. Some of the well-known narrowband speech codecs are listed in table 3.1. 36 Digital speech transmission The adaptive multi-rate (AMR) coder standardized by ETSI is an example of speech coders having multiple bit rates [5]. It is based on ACELP technology and operates at eight different bit rates from 4.75 kbit/s to 12.2 kbit/s. The highest bit rate corresponds to the GSM EFR speech coder. The bit rate of the AMR coder adapts to the radio channel conditions so that when more channel capacity is available more bits can be used for speech coding, and consequently better speech quality is achieved. Table 3.1. List of the most relevant narrowband (300-3400 Hz) speech codecs, their bit rates, and trend-setting MOS values [36]. 3.4 Algorithm Standard Bit rate (kbit/s) Quality (MOS) log-PCM G.711 64 4.3 ADPCM G.726 16/24/32/40 toll RPE-LTP GSM FR 13 3.7 VSELP IS-54 5.6 3.6 LD-CELP G.728 16 4 CS-ACELP G.729 8 4 ACELP GSM EFR 12.2 4 Wideband and beyond The first speech codec standardized for wideband speech for mobile communications was the Adaptive Multirate Wideband (AMR-WB) codec [37]. The AMR-WB was released in 2001 by the 3rd Generation Partnership Project (3GPP), and the same speech coder was also selected as a ITU-T recommendation, G.722.2. The AMR-WB codec is based on ACELP technology and operates at nine bit rates from 6.6 kbit/s to 23.85 kbit/s. The main difference to the GSM AMR codec is that it operates at a sampling rate of 16 kHz, which is required for the nominal range of 50–7000 Hz. In mobile communications, the speech quality of PSTN was achieved with GSM EFR and AMR codecs. However, it was not until the AMR-WB codec that the narrowband PCM quality was finally exceeded. In [37], the perceived speech quality of the AMR-WB with bit rates starting from 8.85 kbit/s are superior to the quality of narrowband AMR at 12.2 kbit/s. For wideband telephony in GSM or UMTS networks, either tandem-free operation (TFO) or transcoder-free operation (TrFO) is required. In TFO, the compressed wideband speech parameters are transmitted within the PCM 64 kbit/s bit stream. Wideband speech quality is achieved, but a 64 kbit/s bit rate is required. Another option is to use TrFO, where the 37 Digital speech transmission whole end-to-end link supports the same codec type, and transcoding is not needed. With TrFO, wideband speech is obtained with lower transmission rates. The integration of the AMR-WB in 2G and 3G mobile networks is ongoing. The first operator to launch the AMR-WB for consumers in a 3G network was Orange in Moldova in September 2009 [7]. For consumers, the better speech quality achieved by AMR-WB coding is marketed as HD voice, and the number of operators worldwide offering HD voice in 3G networks is gradually increasing. However, the transition to wideband is time-consuming, since all terminals and networks have to be upgraded to wideband before full coverage is achieved. During the transition phase, the speech bandwidth of each phone call is dictated by the weakest link of the connection, i.e. the two terminals and the network between them. Furthermore, the speech quality may vary within a phone call due to both narrowband-to-wideband and wideband-to-narrowband handovers. In these situations, ABE methods can be utilized to narrow the quality gap between narrowband and wideband speech. The next generation of mobile networks (4G), called Long Term Evolution (LTE), differs from the PSTN and 2G/3G networks by being a packet switched network instead of a circuit switched one. LTE was designed especially to offer higher data rates with lower latency for the increasing number of mobile data services. From the traffic point of view, even though mobile data volumes exceed voice volumes, conversational telephony will remain an important application in future mobile networks as well. Various solutions for voice over LTE (VoLTE) have been designed. For example, in GSM/WCDMA, the expected deployment strategy of VoLTE is three phased [38]. In the first phase, voice is transmitted in the legacy 2G/3G network, while only data is carried on LTE. This solution is called the circuit-switched fallback. The next phase uses IP multimedia system (IMS) based on VoIP solutions that enables handover from IMSbased VoIP to circuit switched speech when the user equipment is running out of VoIP coverage, and vice versa. Finally, in the third phase of the deployment strategy, all calls are made over packet-switched networks. When the LTE was introduced by the 3GPP, the speech codecs were inherited from the UMTS [39]. The suitability of the default codecs, AMR and AMR-WB, have been tested for packet-switched conversational multimedia applications in [40]. However, VoLTE offers new possibilities for even wider speech bandwidth, and consequently for better speech quality. 38 4. Speech quality and intelligibility In a telephone conversation, the acoustic signal is perceived by the nearend user’s ear, causing an auditory event in the brain, which results in a description of the sound [41]. Both the prior knowledge of the communication system and the emotions of the listener affect the quality judgement [42]. In addition, both the content and the form of the signal are analysed by the listener, and in telecommunications, speech quality usually refers to the form of the speech signal, i.e. the acoustic signal, although both content and individual factors affect the quality perception [43]. Speech quality is complex to define. According to [44], speech quality encompasses attributes such as naturalness, clarity, pleasantness, and brightness. Quality relates to "how" a speaker produces an utterance. Intelligibility, on the other hand, is "what" is been said, and it is not a dimension of speech quality. A similar definition of speech quality is also used in this thesis. There are, however, other ways to define speech quality, where intelligibility is considered as a dimension of speech quality [45]. In speech transmission, the minimum requirement is that speech is intelligible enough, that what is said is understood. As the intelligibility increases, other factors like naturalness and recognizability of the voice become more important. During the last decades, many new speech coding and transmission systems have been introduced. Besides the traditional narrowband (300– 3400 Hz) speech coding and transmission, also wideband (50–7000 Hz) and superwideband (50–14000 Hz) speech transmission have been deployed in many telephone applications. In addition, together with VoIP applications, the number of different degradation types in speech signals, such as packet loss, has increased. As a result, the need for new effective and reliable methods for the evaluation of speech transmission systems, coding standards, and speech enhancement methods has increased. Speech quality can be assessed by subjective or objective methods. In 39 Speech quality and intelligibility subjective (auditory, perceptual) methods, the evaluation is made by asking people’s opinions on speech sounds and applying statistical analysis to the data. Objective (instrumental) methods, on the other hand, are computational measures that try to predict the subjective speech quality. 4.1 Subjective quality evaluation Subjective sound quality assessment can be categorized into four groups as shown in table 4.1 [46, 47]. The vertical categorization is made between utilitarian and analytic methods. In utilitarian methods, the subjects evaluate the integral quality using a one-dimensional scale. These tests can be used to compare varying conditions resulting from different speech coding algorithms. In analytical test methods, the subjects evaluate certain features of the perceived sound. The subjects may be asked to assess one certain feature using a one-dimensional scale or a number of quality features with several scales. The horizontal classification of the quality test methods in table 4.1 is made between subject-oriented methods and object-oriented methods. The former focus on gathering information on human perception, whereas the latter are used to evaluate the quality of a certain system. Table 4.1. Subjective listening quality tests following [46]. Subject-oriented Object-oriented Utilitarian Psychoacoustic research Sound quality assessment Analytical Audiological evaluation Diagnostic listening tests The evaluation methods of speech transmission systems are mainly objectoriented. The overall speech quality is often analysed with utilitarian methods. For example, in audio standards, such as ITU-T P.800, P.830 and P.805, most of the methods are utilitarian and univariate tests. However, also analytical methods may be used, and they could result in more in-depth understanding of speech quality in digital transmission systems. An example of such tests is the ITU-T Recommendation P.835, that is a standard for evaluating speech transmission systems that include a noise suppression algorithm [48]. In the test, the subjects are asked to assess the speech signal, the background noise, and the overall effect separately. Subjective evaluation of speech quality is needed when reliable objective measures do not exist. Also, subjective evaluation may be employed when evaluating the overall quality or complex parameters of speech quality. 40 Speech quality and intelligibility However, subjective evaluation is not very effective, because the collection of each data point requires that a subject grades the performance of a sample. The resulting data is also prone to variance, since subject’s personal opinion is involved. To improve the reliability of subjective evaluation, rigorous testing procedures are required. Formal subjective tests following formal methods and standards are more time-consuming to arrange than informal tests. However, they provide more reliable and repeatable results. In addition, formal statistical analysis gives information on the quality of data. The selection of the listeners for a subjective test depends on the test type in use. Naive (untrained) test subjects are usually involved in utilitarian tests, where the objective is to find out the opinion of the average telephone user. According to [49], a naive subject has not been involved in work connected with assessment of telephone circuits, has neither participated in any subjective test for at least the previous six months nor in any listening opinion test for at least one year, and has not heard the same sentence lists before. The subjects should not have any kind of hearing impairment, and their mother tongue should be the same as the language in the test [47]. An alternative for naive test subjects is to use expert (trained) listeners. For instance, in analytical methods, the results of the test are more reliable if the subjects have been trained for the task. 4.1.1 Listening-only tests The speech quality of a communication system is most often assessed by listening-only tests. Usually this refers to using a pre-recorded and processed set of speech samples that is presented to the subjects, for example, through headphones, and the subjects are asked to assess the quality using a predefined scale given by the experimenter. An advantage of these methods is that the evaluation is subjective, i.e. subjects listen to real speech samples and grade them according to their personal opinion. A drawback is that listening tests are rather directed, meaning that test design factors and rating procedures affect the subjects perception process. For example, the samples are pre-recorded and usually quite short (one sentence, for example), and thus a single artefact in a sample may dictate the entire grading of the sample. Furthermore, the subject focuses mainly on the form of the acoustic signal and not the content because he or she is placed only in a listening context, which is not natural in a normal telephone conversation situation. 41 Speech quality and intelligibility ITU-T recommendation P.800 describes several test methods and category rating scales for listening opinion tests that are often used to evaluate speech transmission systems [49]. The recommendation also describes reference conditions that are important if two tests arranged in different laboratories or at different times are to be compared with each other. Absolute category rating (ACR) is perhaps the most widely used test to assess the overall speech quality. In the test, the subjects are asked to assess the speech quality using the 5-point mean opinion scale (MOS) shown in table 4.2. For instance, the performance of speech codecs is usually evaluated with ACR tests. Listening effort and loudness preference may also be evaluated using a 5-point scale from 5 to 1 [49]. Table 4.2. Mean opinion score according to ITU-T P.800 [49]. Score Quality of the speech 5 excellent 4 good 3 fair 2 poor 1 bad The degradation category rating (DCR) test is designed to compare small degradations in speech quality compared to a reference system. The subject is asked to evaluate a degraded signal compared to a reference using the degradation mean opinion score (DMOS) shown in table 4.3. The reference signal is always presented first to the subjects followed by the test signal. Table 4.3. Degradation mean opinion score according to ITU-T P.800 [49]. Score Degradation is 5 inaudible 4 audible but not annoying 3 slightly annoying 2 annoying 1 very annoying The comparison category rating (CCR) test is another test where two signals are compared with each other. In the CCR test, the latter signal is rated compared to the former with the comparison mean opinion score (CMOS) shown in table 4.4. This test can also be applicable to assess small differences between two or more systems. The results of a CCR test provide information on which sample is better and by how much. 42 Speech quality and intelligibility Table 4.4. Comparison mean opinion score according to ITU-T P.800 [49]. Score Quality of the second compared to that of the first 3 much better 2 better 1 slightly better 0 about the same -1 slightly worse -2 worse -3 much worse Binary paired comparison can be employed in speech quality assessment as well. The test signals are presented to the listeners in pairs, and the subjects are asked to choose the one they prefer. The paired comparison test is suitable when the differences between the conditions are small. 4.1.2 Conversational tests Compared to listening-only tests, conversational tests are a step closer to a real conversation situation. Everyday telephone communication is mostly conversational, and the quality perception is undirected and individual. The listener pays attention to features that the individual person considers relevant for the communication situation. For example, the telephone user may consider important features like intelligibility, loudness, listening effort, or naturalness. The significance of semantic information increases as the user is placed in a conversational context where listening, talking, double-talk, and periods of mutual silence alternate. Conversational tests are, however, more expensive and time-consuming to arrange and therefore quite rare. A newish ITU-T recommendation P.805 describes a subjective test to assess conversational quality over a telephone transmission [50]. The test can be used to evaluate either a specific source of degradation, such as delay or echo, or the overall quality of a transmission system. The target is to have as realistic a communication environment as possible, where two people are having a true spontaneous conversation over a telephone system. In practice, two subjects at a time are placed in separate sound proof rooms. The subjects can be experts, experienced, or naive participants, depending on the purpose of the test. During the test, the subjects have a conversation on a given topic, or task, and then give their opinion on the voice quality. There can be simulated noise environments either at 43 Speech quality and intelligibility one end of the conversation or at both ends. 4.1.3 Field tests The speech quality experienced by the end user is affected by the entire speech link, from the far-end to the near-end user. Field tests offer the most realistic environment for assessing speech quality of transmission systems. For example, the speech quality of a mobile phone could be assessed during true phone calls. In practice, field tests are rare and expensive to arrange, especially during the development process of a system, and therefore quality tests are usually arranged in laboratory conditions. Laboratory conditions are also easier to design such that the test is repeatable. 4.2 Objective quality evaluation Objective (instrumental) quality assessment methods are computational tools that have been developed to evaluate the quality of speech signals and communication systems. Some of the measures analyse only certain aspect of the signals, whereas others try to model human perception more precisely. They are fast, repeatable, and automatic tools compared to time-consuming subjective testing, but still they are only models and can not replace subjective testing completely. Objective quality assessment methods are often classified as parameterbased and signal-based models [43, 47]. Parameter-based models, such as the E model [51], rate the integral quality of an entire transmission path. They provide information on the whole network and are useful, for example, in network planning. Signal-based methods are more useful in the field of speech enhancement. In such methods, the quality score of a system is computed directly from the degraded signal and the original reference signal, or alternatively only from the degraded signal. Some of the methods measure only one feature of the sound, whereas others model the overall quality through a perceptual model. The simplest objective measures include time-domain and spectraldomain measures, such as SNR ratio, mean square error, or spectral distance measures. These measures can be useful in many cases, but are not very good predictors of subjective speech quality. Therefore, perceptual measures have also been designed. 44 Speech quality and intelligibility The perceptual evaluation of speech quality (PESQ), standardized in ITU-T Recommendation P.862 [52], can be used to evaluate narrowband speech codecs or end-to-end quality. Both the degraded and a reference signal are necessary. The model calculates several delays between the two signals and compares them using a perceptual model. The result is an objective quality score. This score can be mapped to a listening quality MOS that allows linear comparison with the MOS. An extension of the PESQ for wideband signals is presented in the recommendation P.862.2 [53]. An upgrade for the PESQ is called the perceptual objective listening quality analysis (POLQA) and is presented in the recommendation ITU-T P.863 [54]. The POLQA supports narrowband, wideband and superwideband speech signals, and it is intended to cover most of the telephone network scenarios. 4.3 Intelligibility tests The first subjective tests focused on measuring intelligibility through subjective tests. In articulation or word tests, intelligibility is measured as the percentage of correctly recognized speech sounds at the receiving end of a system. Either short segments of speech, such as monosyllables, or complete words can be used. In rhyme tests, introduced by Fairbanks [55], rhyming words are used instead. The diagnostic rhyme test (DRT) consists of 96 rhyming word pairs that differ in their initial consonant [56]. The subject hears one word at a time and identifies the word he or she thinks he or she heard from the pair of words listed. An error rate is calculated from the results. The modified rhyme test (MRT) contains 50 word lists of six one-syllable words, differing in either the initial or the final consonant [57]. The subject hears one word at a time and chooses from the list the word he or she thinks he or she heard. The result of the test is an error rate of correctly identified words. Speech intelligibility tests with speech segments longer than single words take better into account impairment of continuous speech. Speech reception threshold (SRT) is a test to find a presentation level for test speech necessary for a listener to understand the speech correctly a specified percentage of the time, usually 50%. In the SRT, complete test sentences are presented either in silence or in the presence of a reference noise signal. 45 Speech quality and intelligibility 46 5. Artificial bandwidth extension of speech 5.1 Background ABE methods for speech attempt to regenerate the frequency content that is lost due to narrowband speech coding. Usually narrowband speech bandwidth refers to the frequency range from 300 Hz to 3400 Hz that is used in many existing speech coding standards, e.g. in PCM [31], the AMR codec [5], or the G.729 codec [58]. The ABE method typically doubles the 8-kHz sampling rate of narrowband speech to 16 kHz and adds new frequency content to the signal. Bandwidth extension towards high frequencies creates new content in the frequency band from 3.4 kHz (or 4 kHz) to 7 kHz (or 8 kHz). There are also bandwidth extension methods towards low frequencies, 50–300 Hz [59, 60, 61]. Although the naturalness of voice is degraded due to the missing low frequency content, these methods are not studied in this thesis. High frequencies are more important from the point of view of speech intelligibility. Furthermore, the reproduction of the low frequencies with small earpieces of mobile devices is not always possible. Bandwidth extension methods can be further classified as artificial methods and methods with side information. The word "artificial" refers to algorithms that attempt to regenerate the lost frequency content utilizing only the information available in the narrowband signal, i.e., no information about the missing frequencies is transmitted. Since the extension is solely based on the narrowband signal, these methods can be implemented at the receiving end of the transmission channel. There are also methods that are not artificial but utilize transmitted side information in the extension procedure [9, 10, 11]. These methods hide some side information related to the missing frequency band in the narrowband sig- 47 Artificial bandwidth extension of speech nal. The methods are independent and can be used with any narrowband speech codecs. A drawback of the bandwidth extension methods with side information is that their exploitation requires that the same method is supported at both ends of the communication link. On the other hand, the performance of such methods can obviously be superior to that of the artificial bandwidth extension methods, as was stated in [9, 11]. Bandwidth extension with side information can be also implemented as a part of speech coding, as in the AMR-WB codec [6] or in the G.729.1 codec [62]. The motivation for utilizing bandwidth extension techniques in speech codecs is to obtain better speech quality at very low bit rates, rather than to overcome limitations due to narrowband speech coding or to enhance transmitted narrowband speech. The focus of this section, and of the whole thesis, is on ABE methods that aim to regenerate the signal at high frequencies. The missing frequency band from 4 to 8 kHz, i.e. the extension band, is therefore referred to as the highband. 5.1.1 Correlation between narrowband signal and the missing highband Typically, ABE methods are data-driven algorithms that utilize true wideband references in the training of the extension procedure. They are built on the assumption that the narrowband signal and the missing highband are correlated and the narrowband signal contains enough cues to regenerate the missing highband. Especially for voiced speech, the correlation originates from the low-pass characteristics of the excitation signal. The dependency between the narrowband signal and the missing highband has been addressed in [63, 64, 65, 66]. The upper bound on the achievable quality of a memoryless bandwidth extension system was discussed in [64]. In their study, the mutual information between the features of the narrowband speech signal and the representation of the missing frequency band was evaluated with respect to an objective distance measure, a mean log spectral distortion (LSD), between the ABE output and the true wideband signal. The results indicate that to minimize the LSD, the narrowband features should be selected so that the mutual information (MI) is maximized. The ratio between the MI of the narrowband and the highband, and the entropy of the highband was used in [63] to measure the uncertainty of the highband envelope given the narrowband. The dependency was found to 48 Artificial bandwidth extension of speech be relatively small, and the authors conclude that a bandwidth extension method with a memoryless mapping may perform well, not because of an accurate prediction of the highband, but because the signal bandwidth is extended such that the signal sounds pleasant. As an interpretation for the low MI between the narrowband and highband, it was explained in [65] that instead of one-to-one mapping, the narrowband and highband spectral envelopes have a one-to-many relationship. The dependency analysis between the narrowband and the highband was further extended in [66, 67] by investigating the role of speech memory in increasing the certainty of the highband. The results showed that the certainty is increased as measured by the ratio of MI to the highband entropy, and the bandwidth extension methods can benefit from a shortterm memory. 5.2 General model for artificial bandwidth extension A general model for ABE that most of the ABE algorithms follow is shown in figure 5.1. The input signal, snb , is a narrowband signal with a sampling rate of 8 kHz. Through interpolation and lowpass filtering the signal is up-sampled to 16 kHz. The resulting signal, slo , is a wideband signal with narrowband content. A feature set is extracted from the input signal and used to estimate parameters for highband shaping. The excitation for the highband is first created to regenerate the highband signal. The highband signal, shi , is obtained after shaping the excitation utilizing the estimated shaping parameters. Finally, the ABE output, sabe , is obtained by adding the generated highband signal and the up-sampled and lowpass filtered original narrowband signal. The general model for ABE methods presented in figure 5.1 can be further modified to describe methods that are even more specifically based on the LPC-based speech production model. A majority of the ABE methods in the literature more or less follows the algorithm structure shown in figure 5.2. The highband excitation signal is obtained by extending the narrowband LPC residual signal to the highband. Additionally, the LPC coefficients for the highband envelope are also extended, and they are used in the LPC synthesis to produce the highband signal. 49 Artificial bandwidth extension of speech Interpolation Lowpass slo filter extraction Estimation of the envelope parameters Excitation Envelope creation shaping Feature snb + sabe shi Figure 5.1. General model for ABE. The input signal, snb , is a narrowband signal with a sampling rate of 8 kHz. The ABE output, sabe , with a doubled sampling rate of 16 kHz, is obtained by adding the up-sampled narrowband signal, slo , and the regenerated highband signal, shi . Interpolation slo Lowpass filter Feature snb extraction Estimation of highband LPC parameters LPC LPC Extension synthesis analysis of the enb excitation ēwb filter filter + sabe Highpass filter shi Figure 5.2. ABE based on LPC. The input signal, snb , is a narrowband signal with a sampling rate of 8 kHz. The ABE output, sabe , with a doubled sampling rate of 16 kHz, is obtained by adding the up-sampled narrowband signal, slo , and the regenerated highband signal, shi . 50 Artificial bandwidth extension of speech a cos(ΩM n) × Highpass Filter sM (n) snb (n) Delay compensation + sex (n) Figure 5.3. Extension of the excitation by cosine modulation. A delayed version of the original narrowband signal, snb (n) is added to the modulation output, sM (n), that has been highpass filtered. As a result, a wideband excitation signal, sex (n) is obtained. ΩM is the modulation frequency, and a is selected from a{1, 2} so that the power of the excitation signal is correct. 5.3 Extension of the excitation signal The techniques to create the spectral fine structure for the highband are generally called excitation extension methods. The word excitation signal originates from the source-filter model of the speech production mechanism, and in many ABE algorithms the extension of the excitation refers directly to the LPC residual signal, as shown in figure 5.2. However, the extension of the excitation may cover also spectral widening techniques used by such ABE methods that widen and shape the spectrum without specifically exploiting the source-filter model, as shown in figure 5.1. In addition, the highband excitation can be derived directly from the narrowband signal [68, 69] or more popularly from the narrowband LPC residual [70, 71, 72]. Non-linear processing applied to the narrowband excitation signal, snb (n), is used in many ABE algorithms as a technique to extend the excitation signal [68, 73, 74, 75, 76, 77]. The most popular non-linear functions include a quadratic function, (snb (n))2 , a cubic function, (snb (n))3 , and a fullwave rectifier, |snb (n)|. These non-linear functions produce harmonic distortion components without a pitch estimation. A disadvantage of nonlinear functions is that they produce highband excitation having a varying amplitude spectrum. Therefore, some sort of spectral flattening is needed. Excitation extension can also be implemented through time-domain modulation of the narrowband excitation signal, which corresponds to spectral translation and folding methods [78]. Modulation techniques produce shifted copies of the original subband spectrum in the missing frequency 51 Artificial bandwidth extension of speech range. A block diagram of the modulation with a real-valued cosine function is shown in figure 5.3. A delayed version of the original narrowband signal, snb is added to the modulation output, sM (n), that has been highpass filtered. The modulation function is of the form sM (n) = snb (n) · a cos(ΩM n), (5.1) where ΩM is the modulation frequency, and a is selected from a{1, 2} so that the power of the excitation signal is correct. In spectral translation, the modulation frequency is fixed, and it produces a shifted copy of the original spectrum in the higher frequency range [79, 78]. If the frequency band that is to be copied is from Ωlo to Ωup , the modulation frequency, ΩM , is given as ΩM = Ωup − Ωlo . (5.2) The cut-off frequency of the highpass filter is Ωup . As a result, the output signal, swb (n), is a sum of the original signal and its shifted copy from Ωup to Ωup + ΩM . The extension starts right where the bandwidth of the original signal ends. Spectral folding (or mirroring) is a special case of modulation and corresponds to modulation by the Nyqvist frequency, ΩM = π, (i.e. 8 kHz in narrowband telephony). The output of spectral folding is a mirror image of the narrowband spectrum in the highband. It has the same effect as up-sampling the signal by two: swb = snb (n) + snb (n)(1 + (−1)n ). (5.3) Spectral folding can be applied to the narrowband signal (for example [68, 69]) or to the LPC residual signal (for example, in [70, 80, 81, 71, 82, 72, 83]). Spectral folding produces frequency components almost up to 8 kHz, but on the other hand, there is a spectral gap at around 4 kHz due to the original telephone bandwidth of 300–3400 Hz. The gap could be avoided by folding the spectrum already at 3.4 kHz. However, it has been reported that spectral gaps of moderate bandwidth have almost an inaudible effect in perception of speech [78]. A disadvantage of the modulation techniques with a fixed frequency is that the harmonic structure is not preserved in the highband. On the other hand, the human ear is not very sensitive to the harmonic structure at high frequencies as it is at low frequencies. In [15], the correction of highband harmonic structure was found to be unimportant for the perceived quality of ABE processed speech. Therefore, for example, spectral folding has been a popular 52 Artificial bandwidth extension of speech choice for excitation extension in ABE methods. The harmonic structure of the excitation signal can be preserved by utilizing an adaptive modulation frequency that is dependent on the current pitch frequency [79]. The pitch adaptive modulation has been implemented both in the time domain [84, 64] and in the frequency domain [85]. Sinusoidal synthesis has also been used to create a harmonic structure in the missing frequency band [86, 87, 88]. For sinusoidal synthesis, the harmonic structure for the highband is created by a bank of oscillators with frequencies, amplitudes, and phases that are determined from the narrowband speech. In sinusoidal synthesis, no spectral flattening is needed, since the spectral envelope can be created directly through sinusoidal amplitudes. An excitation extension method of modulated noise is motivated by the fact that the harmonic structure becomes more noisy above 4 kHz. Furthermore, in the frequency band above 4 kHz the resolution of human hearing starts to worsen and the pitch periodicity is perceived through the time-domain envelope of the bandpass speech signal [89]. Therefore, the excitation can be extended by modulating highband noise with the time-domain envelope of the bandpass (2.5–3 kHz) signal [90, 91, 92]. 5.4 Extension of the spectral envelope The extension of the spectral envelope from the narrowband to the highband is usually made based on some model that maps a narrowband feature vector onto the wideband envelope. The feature vector may consist of features that describe the shape of the envelope directly or some other features that are related to the temporal waveform. 5.4.1 Features A set of typical features used in ABE methods is given in the following. Linear predictive coding (LPC) prediction coefficients can be utilized to describe the spectral envelope. Line spectrum frequency (LSF) also known as line spectrum pair (LSP), is an alternative representation for LPC coefficients [93]. LSF decomposition is used in many speech coding standards as being an efficient quantization technique. The LSFs are defined as the roots of the polynomials 53 Artificial bandwidth extension of speech P (z) = A(z) + z −(p+1) A(z −1 ) Q(z) = A(z) − z −(p+1) A(z −1 ), (5.4) where A(z) is the LPC inverse filter. Cepstral coefficients, c(n), are computed as an inverse discrete Fourier transform of the logarithm of the power spectrum of the signal, s(n), [13]: c(n) = F −1 {log |F{s(n)}|} (5.5) Alternatively, cepstral coefficients that are transformed from the LPC coefficients are also often used in ABE methods. Cepstral coefficients [c0 , c1 , . . . , cP ] are calculated from the LPC coefficients [a0 , a1 , . . . , aP ] using a recursive equation: c0 = σ 2 ci = −ai − i−1 n n=1 i cn ai−n , (5.6) where σ is the rms value of the signal that is normalized to 1. Mel frequency cepstral coefficients (MFCC) are computed as cepstral coefficients, but instead of using a linear frequency scale, a perceptually motivated mel frequency scale is used [94]. Energy-based features include different versions of frame energy. Energy can be calculated from the entire narrowband signal or from subbands. Frame energy for a time-domain frame s(n) with a frame length N is defined as xe = N −1 (s(n)2 ), (5.7) k=0 Zero crossing is a traditional feature for voiced/unvoiced clustering. It is calculated from the time-domain frame s(n): xzc = −1 1 1 N |sign(s(k − 1)) − sign(s(k))|, N − 1 k=1 2 (5.8) where N is the frame length, and the sign operation is defined as sign(x) = 54 ⎧ ⎪ +1, ⎪ ⎪ ⎨ if x > 0 0, if x = 0 ⎪ ⎪ ⎪ ⎩ −1, if x < 0. (5.9) Artificial bandwidth extension of speech Gradient index was originally introduced in [95]. It is a feature for describing voiced/unvoiced characteristics of a speech frame s(n): xgi = 1 10 N −1 k=1 Ψ(k)|s(k) − s(k − 1)| N −1 2 k=0 (s(k)) , (5.10) where N is the frame length, Ψ(k) = 1/2|ψ(k) − ψ(k − 1)|, and ψ is the sign of the gradient s(k) − s(k − 1). Spectral flatness is a frequency-domain feature calculated from the power spectrum |S(ejΩi )|: Ni xsf = log10 Ni −1 jΩi )|2 i=0 |S(e (5.11) , 1 Ni −1 jΩi )|2 i=0 |S(e Ni where Ni is the length of the discrete Fourier transform (DFT), ejΩi is the ith DFT frequency, j being the imaginary unit, and e = 2.7182 the base of the natural logarithm [96]. Centroid of the power spectrum is a frequency-domain feature that results in higher values for unvoiced speech than for voiced speech. It is defined as: Ni /2 xsc = i=0 f (i)|S(ejΩi )| (Ni /2 + 1) Ni /2 i=0 |S(ejΩi )| , (5.12) where |S(ejΩi )| is the power spectrum, f (i) refers to the frequency in the ith DFT bin, and Ni is the length of the DFT [96, 72]. 5.4.2 Distance measures Distance measures play an important role in many ABE algorithms. Typically, the distance measure between two spectral envelopes is needed in the training phase of a Gaussian mixture model, codebook, or neural network. In codebook-based methods, a distance measure is also utilized in the selection of a codeword for each frame. Similar measures have also been used in objective quality evaluation of ABE methods. Most of the spectral distance measures are mean square error measures that are applied either to the spectrum directly or to a parametric representation, such as the LPC envelope. Examples of typical distance measures are: Logarithmic spectral distortion (LSD) is mean square error in dB: dLSD = 1 2π π −π (20 log10 σ σ̂ )2 dΩ, − 20 log10 |A(ejΩ )| |Â(ejΩ )| (5.13) 55 Artificial bandwidth extension of speech where 1 Â(ejΩ ) 1 A(ejΩ ) is the LPC envelope of the original highband speech, is the LPC envelope of the estimated highband speech, and σ and σ̂ are the respective relative gains. Cepstral distance is a commonly used distance measure in speech processing that is calculated directly from p cepstral coefficients of the original wideband signal, c, and the estimated wideband signal, ĉ: dCEPS = p (ci − ĉi )2 . (5.14) i=1 5.4.3 Codebook mapping Codebook mapping was one of the first approaches for highband envelope estimation in the ABE field [97, 70, 98, 80, 99, 81, 100, 91, 76]. Put simply, a narrowband codebook consists of a list of codewords, i.e. narrowband feature vectors, and corresponding highband envelopes. The feature vector computed from the input speech is compared with each of the codewords, and the best match is selected. The best match is decided on the bases of an error measure, for example the mean square error, between the input vector and the codebook entries. The basic codebook mapping can be improved with modifications as presented in [99, 100, 91]. Separate codebooks were constructed for unvoiced and voiced fricatives in [100], which improved the performance of the algorithm measured in terms of spectral distortion (SD). Similar results were obtained in [99], where a split codebook for voiced and unvoiced speech sounds improved the performance. In addition, a codebook mapping with interpolation, i.e. where the highband envelope is calculated as a weighted sum of the most probable codebook entries, was reported to enhance the extension quality in [99, 91]. Furthermore, in [91], codebook mapping with memory was implemented by interpolating the current envelope estimate with the envelope estimate of the previous frame. Typically, both the narrowband feature vectors and the highband codewords directly represent the spectral envelopes through LPC coefficients ([97, 70, 80]) or LSFs ([98, 100, 91]). MFCCs have also been used in [81]. In [76], a slightly different approach was chosen, where an estimate for highband energy is first calculated after which the energy is mapped onto the highband spectral envelope codebook. 56 Artificial bandwidth extension of speech 5.4.4 Linear mapping Linear mapping is utilized in [73, 99, 82] to estimate the spectral envelope of the highband. In linear mapping, the input narrowband envelope is characterized by a vector of parameters x = [x1 , x2 , . . . , xn ] and the wideband envelope to be estimated by another vector y = [y1 , y2 . . . , ym ]. For example, LPC prediction coefficients or LSFs can be used as input and output parameters. The linear mapping between the input and output parameters is then denoted as y = Wx, (5.15) where the matrix W is obtained through an off-line training procedure with the least-squares approach that minimizes the model error y − Wx using a training data with narrowband envelope parameters X and corresponding highband parameters Y: W = (XT X)−1 XT Y. (5.16) To better reflect the non-linear relationship between the narrowband and highband envelopes, modifications for the basic linear mapping technique have been presented. Instead of using a single mapping matrix, the mapping can be implemented by several matrices. In [82], speech frames are clustered into four clusters based on the first and the second reflection coefficients, and a separate mapping matrix is created for each cluster. The algorithm in [82] hence utilizes hard-decision clustering, whereas a soft decision scheme is implemented in [73], where the clustering is performed through vector quantization of the input vector, x, and the final output vector, y, is formed as a weighted sum of the mappings obtained for all clusters. The evaluation of the performance of the ABE methods with linear mapping is rather concise. In [73], based on an informal listening test conducted in the Japanese language, the quality of narrowband speech was better than the extended speech. Objective analysis was also provided, showing that SD for piecewise-linear mapping was smaller than for codebook mapping and neural network approaches. On the other hand, the objective comparison given in [99] indicates better performance for codebook mapping compared to linear mapping. 57 Artificial bandwidth extension of speech 5.4.5 Gaussian mixture model In linear mapping, only linear dependencies between the narrowband spectral envelope and the highband envelope are exploited. Non-linear dependencies can be included in the statistical model by utilizing Gaussian mixture models (GMM). GMM is a parametric model for modelling high-dimensional probability density functions (PDF) of a random variable. In ABE methods, the GMM is typically utilized to model the joint PDF between two random variables, x and y. The GMM PDF for variables x = [x0 ...xb−1 ] and y = [y0 ...yd−1 ] is represented as a weighted sum of L Gaussian component densities fG : fGMM (x, y) = L ρl fG (x, y; μx,y,l , Vx,y,l ), (5.17) l=i where L is the number of individual Gaussian components, ρl is the lth mixture weight, μx,y,l is the mean vector and Vz,l is the covariance matrix. These GMM parameters can be estimated from training data through an iterative training procedure, such as the expectation maximization (EM) algorithm. The GMM is utilized directly in envelope extension to estimate wideband LPC coefficients or LSFs from corresponding narrowband parameters [101, 90]. The performance of the GMM-based spectral envelope extension was then enhanced by using MFCCs instead of LPC coefficients [87, 102]. Furthermore, the GMM mapping with memory further results in better performance in terms of LSD and PESQ [103, 104]. In addition to using GMMs directly in envelope extension, they can be used in HMMbased ABE algorithms as well. These methods are discussed in the next section in more detail. The advantage of the GMM in envelope extension methods is that they offer a continuous approximation from narrowband to wideband features compared to the discrete acoustic space resulting from vector quantization. Better results were reported for GMM-based methods compared to codebook mapping in [101] and [90] in terms of SD, cepstral distance, and a paired subjective comparison. 58 Artificial bandwidth extension of speech a22 a12 1 2 a23 3 a34 b3 o 2 b2 o 1 o1 b2 o 2 a44 a33 a45 5 b4 o 4 b3 o 3 o2 4 b3 o 4 o3 b4 o 5 o4 o5 Figure 5.4. Hidden markov model with five states. The transition probabilities from state i to state j are denoted as aij . An observation ot at time instant t, is created according to the output probability densities of state i, denoted as bi . 5.4.6 Hidden markov model Hidden Markov models (HMMs) are statistical models that have been utilized successfully, for example, in speech recognition [105]. A HMM consists of two stochastic processes. The first process is a Markov chain with N finite states. This process is not directly observable but is hidden, and it can be observed only through another stochastic process that produces the sequence of observations. A simple HMM structure with only left-toright transitions between the five states is shown in figure 5.4. At each discrete time instant t, a decision for the next state of the Markov chain is made based on the transition probabilities from state i to state j, denoted as aij . After the next state is determined, an observation ot is created according to the output probability densities of state i, denoted as bi . HMMs have been utilized in the envelope prediction of ABE methods [106, 78, 107]. In [78], each state of the HMM represents a typical highband spectral envelope. During the bandwidth extension, the narrowband signal frames are mapped onto the states of the HMM, and the parameters representing the highband envelope are determined using an estimation rule. The narrowband feature vector in [78] consists of both parameters representing the envelope (LPC coefficients) and scalar features (such as the zero-crossing rate, normalized frame energy, gradient index, local kurtosis and spectral centroid) that contain voiced/unvoiced information. The highband envelope, on the other hand, is represented by cepstral coefficients. The parameters of the Markov chain with states representing the spectral envelopes are obtained by vector quantizing the highband spectral 59 Artificial bandwidth extension of speech envelopes using a training data with true wideband speech. As a result, every state of the HMM corresponds to one entry of the VQ codebook. The resulting HMM structure is ergodic, indicating that a transition from a state to any other state is possible. Furthermore, for each state Si , a statistical model is constructed based on the narrowband features x and the highband spectral envelopes y. The statistical model consists of the state and transition probabilities P (Si ) and P (Si (m + 1)|Sj (m)), respectively, the observation probability p(x|Si ), and the emission probability p(y|Si ). The observation probability, i.e. the PDF of the feature vectors x, for each state p(x|Si ) is estimated by a GMM using the EM training procedure. Finally, the estimation rule defines how the coefficients representing the spectral envelope of the missing highband are formed. In [78], a minimum mean square error (MMSE) rule is applied, and the resulting highband envelope is the sum of individual codebook entries weighted by a posteriori probabilities of the corresponding states of the HMM. An advantage of the algorithm presented in [78] is that, due to the HMM and the MMSE estimation rule, the algorithm is not memoryless, as most of the ABE methods, but information from the preceding frames is exploited. Also, the foundation of the algorithm is strongly based on a widely used and accepted mathematical model that has been successfully exploited in other fields of speech processing. On the other hand, the method is a "black box" implementation, and the performance is dependent on the feature selection and training. 5.4.7 Neural networks Neural networks have been utilized in the estimation of the highband parameters, e.g. in [108, 109, 110, 69, 72]. A multilayer feedforward neural network (FFNN) is the most common neural network, and it was also used in [75, 109, 110, 69, 72]. It comprises an input layer of neurons, hidden layers, and an output layer of neurons. An example FFNN with one hidden layer is shown in figure 5.5. A hidden neuron vj has m input signals xk and an output vj . Each input has a synaptic weight wjk . An adder sums the weighted input signals and an optional bias, bj , and finally the output of a neuron, vj , is defined by an activation function ϕ(·) as follows: vj = ϕ( m wjk xk + bj ). (5.18) j=1 If every node in each layer is connected to every other node in the adja- 60 Artificial bandwidth extension of speech v1 x1 y1 x2 v2 ... ... ... xm yn vo Input layer Layer of of source Layer of output nodes hidden neurons neurons Figure 5.5. Feedforward neural network with one hidden layer between the input and output layers. The input layer comprises m input signals xk , the hidden layer o hidden neurons vj , and the output layer n output signals yi . cent forward layer, the neural network is said to be fully connected, otherwise it is partially connected. Furthermore, a basic FFNN can be modified by adding one or more feedback loops, i.e. by feeding output signals back to the inputs of the neurons. Such neural networks are called recurrent neural networks [111]. The parameters for a neural network are determined through a learning (or training) process, where the connection weights are adjusted iteratively. Usually the learning process is based on an example set of input data for which the network learns to produce the desired output and at the same time to generalize the results for other inputs. Learning methods are often divided into three groups: 1) supervised learning, 2) unsupervised learning, and 3) reinforcement learning. In supervised learning, the the desired behaviour of the neural network is described in terms of a set of input-output examples. An iterative learning process is implemented by minimizing an error function, i.e. a measure between the output of the network and the desired output, of the whole training data. In unsupervised learning, desired outputs are not available, and the learning is based on the correlations of the input data [112]. A typical use of unsupervised learning is clustering, where the system forms clusters of similarities from the input data. In reinforcement learning, the desired output of 61 Artificial bandwidth extension of speech generation fitness genotype environment neural network Figure 5.6. Schematic diagram of neuroevolution methods. Each genotype is tested by decoding it into a neural network and performing the task, resulting in a fitness value for the genotype. After evaluating all the genotypes in this manner, a new generation of genotypes is created by genetic operations among the best genotypes. This process is continued until the fitness is sufficiently high. the network is also unknown, but the performance of the network can be measured by a so-called fitness function. The learning process is therefore reinforced by crediting a desired behaviour of the network and penalizing an undesired behaviour. Neuroevolution methods are a special group of reinforcement learning methods that can be used not only for modifying neural network weights but also topologies [113]. They are also well suited for recurrent networks. The basic idea behind most of the neuroevolution methods is shown in figure 5.6. The first generation of the genotype population is first created. Each genotype is then tested by decoding it into a neural network and performing the task, resulting in a fitness value for the genotype. After all the genotypes are evaluated in this manner, a new generation of genotypes is created by genetic operations among the best genotypes. This process is continued until the fitness is sufficiently high. A three-layer feed-forward neural network is used in [69] to determine weighting parameters of a folded narrowband spectrum in the highband (4-8 kHz), quite similarly to Publication II. Instead of utilizing neuroevolution learning, as in Publication II, a supervised learning method, called 62 Artificial bandwidth extension of speech the Levenberg-Marquardt algorithm, is used to train the network. The input to the neural network is a feature vector of nine narrowband features. The neural network is trained to output weights for critical bands in the highband. The shaping of the folded highband is implemented in the spectral domain by spline functions that are constructed around the critical band weights given by the neural network. The method was tested against both pure wideband and narrwoband references using a CCR test. The results showed clear preference towards ABE samples compared to narrowband, but wideband was also reported superior to ABE samples. Both naive and non-naive listeners were used in the test. In addition to the quality test, MRT was also utilized in evaluating intelligibility between narrowband and ABE samples. The reported results showed some improvement also in terms of intelligibility. Another approach based on neural networks is introduced in [109]. The method closely follows the LPC-based algorithm model shown in figure 5.2. The mapping of the narrowband cepstral coefficients derived from the LPC coefficients onto wideband coefficients is implemented by a FFNN. A simple supervised learning rule called back-propagation is used. In addition, in order to use a neural network for envelope extension, an alternative implementation with linear mapping of the envelope parameters is presented, and the two methods are compared to each other. Both objective and subjective comparison tests reported in [109] indicate nearly the same quality improvement for both neural network and linear mappingbased ABE versions compared to a narrowband reference. In [75], a neural network-based ABE algorithm was compared with a codebook-based method. The neural network topology was rather similar to the one presented in [109]. A multilayer perception (MLP) network with three layers and the back-propagation learning algorithm was used. Both input and output parameters for the neural network were LPC coefficients. The two methods were evaluated with both objective and subjective methods. Interestingly, the results from the objective and subjective tests were inconsistent; three of four objective measures indicated better performance for the neural network algorithm. However, the MOS test resulted in a clear preference for codebook mapping. In [72], an ABE algorithm is studied that uses a neural network to estimate the mel spectrum for the highband. (This technique is essentially the same as the ABE2 algorithm in Publication VII.) In this method, the highband is divided into subbands that, in turn, are weighted to realize 63 Artificial bandwidth extension of speech the mel spectrum. The method uses a neuroevolution algorithm, called neuroevolution of augmenting topologies (NEAT), to train the neural network. The training starts from a minimum topology of the network that evolves during the training process. An advantage of the NEAT algorithm is that it can exploit recurrent connections and thus include memory within the network. According to the CCR listening test results reported in [72], the method improves narrowband-coded AMR speech significantly. Despite their many advantages, neural networks also have disadvantages. In practice, there is an unlimited number of possible network topologies, training algorithms, and training data sets. Using a large set of training data is recommended, which easily leads to a slow development process of the algorithm. Both promising and unpromising results from evaluations of quality were reported in [108, 75, 109, 69, 72]. The comparisons between neural networks with linear mapping and codebooks [75, 109] indicate that with a simple neural network, the performance is perhaps not good enough, but with more complex structures and with perceptual fitness functions, it is possible to obtain good results [72]. 64 6. Artificial bandwidth extension in mobile devices Improved speech quality and intelligibility, achieved through a carefully designed and implemented ABE algorithm, can be exploited in products that receive or store narrowband speech. For example, cellular telephony, car telephony, cellular networks, VoIP telephony, and teleconferencing systems are potential products that benefit from ABE methods. The quality experienced by the end user is dependent on the whole speech processing chain from the far-end user to the near-end user, including the acoustic properties of the user terminals. Therefore, when implementing an ABE method in a product, the properties of the present product should be taken into account. In addition, whether the product, and consequently the ABE method, is used in noisy places or not, influences the quality expectations of the ABE method in use. In a telephone conversation, intelligibility comes before audio quality, but after excellent intelligibility is achieved, the importance of quality increases. For example, in an extremely noisy environment, it is important to understand the message that the conversation partner is delivering and to be able to respond to the conversation. On the other hand, when speaking on the phone in a quiet place, the naturalness of the voice is valued by the end user, and being able to recognize the other person on the phone easily is appreciated. In this section, the implementation of an ABE method in mobile devices and car telephony are discussed in more detail. 6.1 Artificial bandwidth extension in a mobile device The fact that the most of the mobile phone users of the world’s 5.6 billion mobile connections ([34]) are still offered only narrowband speech makes mobile devices an attractive target for an ABE method. During 65 Artificial bandwidth extension in mobile devices the transition phase from narrowband to wideband speech communication, several scenarios for phone calls relative to the speech bandwidth exist, as illustrated in figure 6.1. If either of the devices in a telephone conversation support only narrowband coding, speech is transmitted in narrowband, as shown in figures 6.1 a–d. Only if both devices support wideband speech coding, true wideband speech is achieved, as illustrated in figure 6.1 e. ABE can be implemented in both narrowband and wideband terminals. In practice, in both cases wideband acoustical design is needed. However, whether the device is a narrowband or wideband terminal has an influence on type approval tests that are performed for each mobile device, following the 3GPP specifications [114]. For a narrowband mobile terminal, passing the narrowband performance requirements is sufficient, whereas a wideband terminal should meet wideband telephony performance requirements. Several issues should be taken into account when implementing an ABE algorithm in a mobile device. The algorithm has to be suitable for realtime implementation with a reasonable computational load. The algorithmic delay should be small enough so that the overall downlink processing delay does not exceed 150 ms, because greater delays start to annoy the end users and make the conversation difficult. There are thousands of languages in the world. The most spoken languages that are all spoken by over 100 million people include Chinese, Hindi, English, Spanish, Arabic, Portuguese, French, Bengali, Russian, Japanese, and German [115]. The languages differ from each other in temporal and spectral characteristics, and the ABE method should be as language independent as possible. The algorithm should be speaker independent and result in good speech quality whether the speaker is male or female, an adult or a child. The farend user might have a speaking disorder, a quiet or a loud voice. Speaker dependent methods have been reported to result in superior performance than speaker independent methods, at least in [78]. Perhaps in the future, the fact that a mobile device is often personal could provide an opportunity to develop an architecture where the algorithm adapts to special speech characteristics of the user. Mobile devices are used everywhere in the world: in homes, in offices, in outdoor activities, in cars, in concerts, in discos, at construction sites, and so on. Therefore, the ABE method should be robust against speech signals with all kinds of background noise having both high and low SNR. The 66 Artificial bandwidth extension in mobile devices Figure 6.1. Different phone call scenarios between narrowband and wideband mobile devices. If both the far-end and the near-end terminals are narrowband devices, the speech signal is narrowband as well (a). True wideband is achieved only if both terminals and the network support wideband (e). Otherwise, the speech signal is narrowband, and ABE can be used to widen the spectrum (b, c, and d). 67 Artificial bandwidth extension in mobile devices far-end user may be surrounded by any kind of background noise, which is included in the signal that serves as the input for the ABE method. The ABE method should be able to enhance the speech quality without increasing the noise level. Babble noise, cafeteria noise, and any other noise with high frequency content is extremely challenging for ABE methods because the methods easily start to enhance these high frequency components. Background noise around the end user masks possible artefacts introduced by an ABE method. Therefore, the ABE method is especially useful in mobile phones that are often used in noisy places. Different speech codecs and bit rates with varying speech quality are used around the world. In addition to background noise, codec distortion deteriorates the quality of the narrowband input signal, and the performance of an ABE method should be assessed with distorted input signals of different kinds. 6.1.1 Signal path in mobile telephony The voice quality in a mobile device is affected by the whole chain from the far-end user to the near-end user. The acoustic environment and the background noise conditions around the far-end user have an effect on the quality of the input signal that is captured by the far-end telephone device, which may be a handportable mobile device, a landline telephone, a car hands-free system, or a VoIP application. Taking a mobile device as an example, The microphone of the far-end user captures the speech signal. It is processed through a low-pass filter to remove unwanted signal components, and an analogue-to-digital conversion. The far-end signal path is then followed by speech enhancement algorithms such as noise suppression, echo cancellation, and level control that are implemented in the mobile device. The digital signal is then coded by a speech codec, like the AMR codec, and transmitted through a cellular network. In the near-end device, the decoded signal is first encoded. The resulting digital signal is processed through speech enhancement algorithms such as noise suppression and dynamic level control. ABE processing is typically implemented after other speech enhancement, and the influence of the other algorithms on the ABE quality, especially noise suppression and dynamic range control (compression), should be assessed. After ABE processing, the frequency response of the speech signal is equalized for the transducer. Finally, there is a digital-to-analogue converter before the amplifier and the transducer. 68 Artificial bandwidth extension in mobile devices 15 10 [dB] 5 0 -5 -10 -15 100 1000 Frequency [kHz] Figure 6.2. Receiving frequency mask as specified in [114]. 6.1.2 Acoustic design of a mobile terminal The acoustic design of a mobile terminal comprises the hardware components and their mounting, i.e. how they are placed in the device. The requirements for good ABE quality are basically the same as for good wideband speech, i.e. the earpiece should be able to reproduce a frequency band of 50–7000 Hz with reasonable earpiece distortion. Additionally, equalizers are required to compensate for the unideal frequency response of the hardware components, in particular the microphone and the loudspeaker. For mobile devices, ITU-T specifies frequency masks for both the receiving and sending quality that has to be met in the handportable mode of mobile devices [114]. There are specified requirements for both narrowband and wideband telephony. An ABE method can be implemented in a narrowband device or in a wideband device. In a narrowband device, only narrowband speech transmission is supported, and the ABE method is regarded as a narrowband speech enhancement feature. In a wideband device, wideband speech coding is supported and ABE is used if true wideband speech is not available. The receiving wideband frequency mask, specified in the 3GPP specification 26.131 [114], is shown in figure 6.2. The frequency mask is relatively loose, which is why an ABE method should always be tested with the acoustics of the device. For example, if the ABE algorithm in use produces a spectral gap around 4 kHz, it might not be audible with a flat frequency response, but a non-flat response might boost the gap and thus increase audible distortion. The quality of the state-of-the-art ABE algorithms is not completely comparable with true wideband speech. There are some 69 Artificial bandwidth extension in mobile devices artefacts and distortion in the extended signals, but an adequate ABE quality in a product can usually be achieved by finding a good balance between the original narrowband signal and the regenerated highband one. 6.2 Artificial bandwidth extension in car telephony In many countries, the use of the hands-free cellular phone is mandatory while driving a car for road safety reasons. Hands-free cellular phone usage improves road safety since the driver can better concentrate on driving without having to hold the mobile device to the ear with his or her hand, or even with a shoulder, if both hands are needed for some driving manoeuvre. Hands-free usage of a mobile phone in a car can be achieved by, for example, utilizing the speaker mode of a mobile device, using a wired headphone, through a bluetooth accessory, or by utilizing an integrated in-car hands-free system. Car telephony is an attractive target for ABE methods, because background noise degrades speech intelligibility and increases listening fatigue. In addition, the car interior, which typically has noise with low-pass characteristics, masks possible artefacts generated by ABE processing. In [116], the usage of a bandwidth extension algorithm improved intelligibility of narrowband speech of meaningless vowel-consonant-vowel syllables in a car environment. Some quality improvement was also reported in low (0 dB) SNR conditions. Furthermore, when utilizing ABE in car telephony through integrated hands-free systems (for example, [117]), the acoustic environment around the near-end speaker is always known, and the whole signal processing path can be designed for car usage, such as in [118]. 70 7. Summary of the publications This thesis consists of seven articles four of which were published in international reviewed journals and three in full-paper reviewed conferences. All the articles focus on ABE of speech signals. Publication I: "Artificial bandwidth expansion method to improve intelligibility and quality of AMR-coded narrowband speech" The first publication is a conference article that presents an ABE algorithm. The method is based on spectral folding, i.e., the narrowband spectrum from the frequency range 0–4 kHz is folded onto the highband (4–8 kHz). The folded spectrum is shaped in the FFT domain by spline functions that are optimized using a genetic algorithm. The performance of the algorithm is evaluated by an intelligibility test called SRT and by an ACR test that is often used to assess speech codecs. The results from the SRT test indicate that the algorithm improves speech intelligibility in noise. The ACR test results show no statistical improvement in speech quality compared to narrowband AMR speech. The performance of the algorithm is further evaluated in Publication III. Publication II: "Neural network-based artificial bandwidth expansion of speech" An ABE algorithm utilizing a neural network is presented in the article. The algorithm, called neuroevolution ABE (NEABE), is based on spectral folding. A feature vector is computed from the narrowband signal and serves as an input to a neural network. As an output, the neural network provides parameters required to formulate a spline function that, in turn, is used to shape the highband spectrum. A neuroevolution algorithm is 71 Summary of the publications English 3 Russian 2 2 2 WB Chinese 3 3 WB 1 1 1 0 0 0 ABE -1 NB ABE -1 -1 WB ABE NB NB -2 -2 -2 -3 -3 -3 Figure 7.1. Main results of the CCR language test presented in the order of preference — wideband (WB), ABE, and narrowband (NB) processings — in three languages. Mean scores are indicated by dots and 95 % confidence intervals are shown with error bars. used to train the neural network. The algorithm is evaluated against a clean narrowband reference by a paired comparison test and with a CCR test. The tests indicate a clear preference towards NEABE-processed speech compared to the clean narrowband reference. In addition, an objective measure, the SD, between NEABE outputs and true wideband signals are computed for phonetically labelled speech data. Publication III: "Evaluation of an artificial speech bandwidth extension method in three languages" In this journal article, a potential language dependency of an ABE algorithm is studied. The algorithm is an enhanced version of the method presented in Publication I, and it is evaluated against an AMR-coded narrowband reference and an AMR-WB-coded wideband reference. Three CCR tests are arranged in English, Russian and Mandarin Chinese. All the languages selected for the evaluation are spoken by millions of people around the world. Mandarin Chinese is chosen as one of the test languages because it is a tonal language. Furthermore, Russian is chosen because of its varied set of fricative sounds that are known to be challenging for ABE methods. The test results are consistent in all three languages, indicating that the algorithm is not language dependent. The main results of three CCR tests are shown in figure 7.1. 72 Summary of the publications Figure 7.2. Signal path from the far-end user to the near-end user in a telephone conversation between two mobile phone users. Publication IV: "Development, evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal" This article discusses the development cycle of an ABE method from an initial idea to the implementation in a mobile terminal. In addition to the algorithm development, the process includes several subjective tests and simulations that verify the proper performance of the algorithm in different use scenarios. The article also discusses the utilization of the ABE technology in a mobile device. The signal path from the far-end user to the near-end is illustrated in figure 7.2. The sending and receiving mobile terminals have certain characteristics including the influence of speech coding, other speech enhancement algorithms, and acoustical design of the devices. In addition, background noise at both ends of the communication affects the speech quality experienced by the end user. Publication V: "Binaural artificial bandwidth extension (B-ABE) for speech" In this conference article, the requirements for the ABE algorithm to extend binaural signals are discussed. An ABE algorithm originally designed for monaural signals is modified for binaural speech signals so that the algorithm preserves the localization information of the speakers. A subjective listening test is also organized to analyse the performance of the method. The test results indicate that the algorithm preserves the localization information well. Binaural speech technology can benefit from spatial audio, and if wide- 73 Summary of the publications Figure 7.3. Teleconferencing system including a conference bridge and a terminal with the B-ABE function. 3D processing is performed in the conference bridge and the binaural signal is sent to the terminal where B-ABE processing takes place. band speech is not available, ABE can be exploited. An example teleconferencing system with 3D processing implemented in a conference bridge and B-ABE in a terminal is shown in figure 7.3. Publication VI: "Evaluating artificial bandwidth extension by conversational tests in car using mobile devices with integrated hands-free functionality" This article reports conversational test results for ABE processing. In the test, phone calls between two persons are made using mobile devices over a cellular network. Three connection types are involved in the test: a narrowband connection with AMR coding, a wideband connection with AMR-WB coding, and an ABE connection with AMR coding and ABE processing implemented in the terminal. The test subjects held short conversations in a car with and without simulated car interior noise. The subject in the car was able to switch between two different connections during each phone call, and after the conversation, the subject was asked which connection was better. The placement of the mobile devices in the car is shown in figure 7.4. The results show a clear preference for the ABE connection compared to the narrowband one, whereas wideband was preferred over both narrowband and ABE. To the best of the author’s knowledge, this article is the first to report conversational testing of an ABE method. In addition, for the first time, an ABE algorithm is evaluated with an ABE method running on a true end product and using real cellular networks. 74 Summary of the publications Figure 7.4. Mobile devices in the test car. Publication VII: "Conversational quality evaluation of artificial bandwidth extension of telephone speech" The last article discusses conversational quality testing of ABE more thoroughly. An extensive conversational evaluation consisting of two separate tests was organized. Two different ABE algorithms, a narrowband reference (AMR), and a wideband reference (AMR-WB) were included in the tests. The first test closely followed the ITU-T P.805 recommendation, whereas the second test was a modification of the first and enabled a paired comparison and asymmetric conversation tasks. A schematic illustration of the test setup is shown in figure 7.5. The results indicate that speech processed with one of the ABE methods is preferred to narrowband speech in noise. However, wideband speech is superior to both narrowband and ABE-processed speech. Room 1 Room 2 Simulated telephone AB connection Subject 1 Subject 2 Figure 7.5. Schematic illustration of the conversation test setup. 75 Summary of the publications 76 8. Conclusions Narrowband speech coding that is deployed in many telephone applications limits the speech bandwidth to the frequency range 300–3400 Hz, resulting in degraded speech quality and intelligibility. When only narrowband speech transmission is available, ABE can be applied to the signal at the receiving end of the communication link. An ABE method aims to improve speech quality and intelligibility by regenerating new frequency components in the signal in frequencies above 3400 Hz. Since the extension is made without any transmitted information, the algorithm is suitable for any telephone application that is able to reproduce wideband audio signals. In this thesis, ABE towards high frequencies is addressed from the mobile communication perspective. The following issues are addressed: 1. ABE algorithm development for narrowband speech signals. 2. ABE for binaural signals. 3. Evaluation of the ABE methods through subjective listening tests and conversational tests. 4. Implementation of an ABE method in a mobile device. The developed ABE algorithms are presented in Publication I (enhanced algorithms in Publication III and Publication VII), Publication II and Publication VII. All the three methods are suitable for real-time implementation with a reasonable computational load and delay due to the algorithm. The methods are primarily designed for monaural telephone speech signals, whereas the extension of binaural signals is addressed in Publica- 77 Conclusions tion V. For evaluating the developed ABE methods, several listening opinion tests have been conducted. The quality of ABE-processed signals are compared with both narrowband and wideband references through subjective listening tests in laboratory conditions. The selected test methods include ACR (in Publication I and Publication IV), CCR (in Publication II and Publication III) and paired comparison (Publication II). In addition, the CCR test reported in Publication III addresses language dependency issues of an ABE method. Furthermore, SRT in noise test was utilized to assess how ABE affects speech intelligibility in noise (in Publication I and Publication IV). In Publication VI and Publication VII, the ABE quality is assessed by conversational tests, where the algorithms are evaluated in a more natural speech communication situation between mobile phone users. The implementation of the ABE algorithm presented in Publication I, Publication III, and Publication IV in a mobile phone is discussed in Publication IV. A thorough subjective evaluation verifies that narrowband speech quality and intelligibility can be improved by the novel ABE algorithms developed in this thesis. The results from the tests are consistent and indicate no language dependency. Furthermore, the results have been obtained with realistic speech data, i.e. coded speech from several speakers and languages have been involved in the tests. The results from the SRT test and the conversational test reported in Publication VI indicate that ABE improves quality and intelligibility especially in a noisy environment where the artefacts are not heard due to the masking effect. However, the larger scale conversational test in Publication VII does not completely verify this result, even though the results indicate that ambient noise increased the effort needed to understand the conversation partner, and one of the ABE methods reduced the effort to understand female voices compared with narrowband AMR. ABE for binaural signals has to be implemented such that the binaural cues are preserved. Especially, the ILD is an important cue at high frequencies. In Publication V, the ABE processing improved the 3D localization compared to the narrowband signal. In a mobile phone, the performance of an ABE method depends on the entire processing chain of the downlink speech signal. If the equalized frequency response of the earpiece is not sufficiently flat, some artefacts of the ABE processing may be emphasized. A possibility to tune the algo- 78 Conclusions rithm for each acoustical design separately is beneficial. If the acoustical properties of a mobile device are not optimal for an ABE method, uplink noise dependent tuning can be utilized. Three different ABE algorithms have been developed and evaluated in this thesis. The results from the tests are consistent, indicating that even though the extension is completely artificial, speech quality and intelligibility is primarily improved. However, there is still a gap between the quality of ABE processed speech and true wideband speech. In the future, it would be interesting to study if the quality gap could be further decreased by bandwidth extension for low frequencies. Moreover, the smallscaled conversational test in Publication VI is the only test to date where ABE was evaluated using true mobile devices and cellular connections. The performance of the algorithm was carefully tuned for the particular acoustical design. Despite the fact that the results are promising, even more in-depth field studies would be beneficial in finding out whether the end users adapt to the artefacts introduced by the ABE method or whether they start to annoy the users. In addition, based on this study, an open research question is how telephone users experience the handovers for narrowband to wideband speech and vice versa, compared to hand-overs from wideband to ABE and vice versa. 79 Conclusions 80 Bibliography [1] International Telecommunication Union. ITU-T Recommendation G.712, Transmission performance characteristics of pulse code modulation channels, 2001. [2] B. C. J. Moore and C.-T. Tan. Perceived naturalness of spectrally distorted speech and music. The Journal of Acoustical Society of America, 114(1):408–419, 2003. [3] H. Fletcher and R. H. Galt. The perception of speech and its relation to telephony. The Journal of Acoustical Society of America, 22(2):89–151, 1950. [4] The European Telecommunications Standards Institute (ETSI). Digital cellular telecommunications system (Phase 2); Enhanced Full Rate (EFR) speech processing functions; General description (GSM 06.51 version 4.1.1), 2000. [5] 3rd Generation Partnership Project (3GPP). Adaptive multi-rate (AMR) speech codec, transcoding functions, 3GPP TS 26.090, version 10.0.0., 2011. [6] 3rd Generation Partnership Project (3GPP). Adaptive multi-rate - wideband (AMR-WB) speech codec, transcoding functions, 3GPP TS 26.190, version 6.1.1., 2005. [7] Orange launches world’s first high-definition voice service for mobile phones in Moldova. http://event.orange.com/media/hd_voice_en/ UPL8498801975512936614_CP_Orange_Moldova_HDVoice_en.pdf, 2009. [accessed 26-July-2012]. [8] SILK: Super wideband audio codec. https://developer.skype.com/silk. [accessed 17-June-2012]. [9] B. Geiser, P. Jax, and P. Vary. Artificial bandwidth extension of speech supported by watermark-transmitted side information. In Proceedings of Interspeech, page 1497–1500, Lisboa, Portugal, 2005. [10] A. Sagi and D. Malah. Bandwidth extension of telephone speech aided by data embedding. EURASIP Journal on Advances in Signal Processing, 2007(1), 2007. 81 Bibliography [11] S. Chen and H. Leung. Speech bandwidth extension by data hiding and phonetic classification. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 593–596, Honolulu, Hawaii, USA, 2007. [12] D. L. Richards. Telecommunication by Speech, The Transmission Performance of Telephone Networks. London, Butterworths, 1973. [13] L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, New Jersey, 1978. [14] J. L. Flanagan. Verlag, 1965. Speech Analysis Synthesis and Perception. Springer- [15] H. Pulakka, P. Alku, L. Laaksonen, and P. Valve. The effect of highband harmonic structure in the artificial bandwidth expansion of telephone speech. In Proceedings of Interspeech, pages 2497–2500, Antwerp, Belgium, 2007. [16] G. Fant. Acoustic Theory of Speech Production. The Hague: Mouton, 1960. [17] J. Makhoul. Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975. [18] T. D. Rossing. The Science of Sound. Addison Wesley, 1990. [19] J. Blauert, editor. Communication Acoustics. Springer-Verlag, 2005. [20] M. Karjalainen. 1999. Kommunikaatioakustiikka. Teknillinen korkeakoulu, [21] J. Blauert. Spatial Hearing: The Psychophysics of Human Sound Localization. The MIT Press, 2001. [22] O. Lodge. The history and development of the telephone. Journal of the Institution of Electrical Engineers, 64(359):1098–1114, 1926. [23] S. Voran. Listener ratings of speech passbands. In Proceedings of the IEEE Workshop on Speech Coding For Telecommunications, pages 81–82, 1997. [24] N. R. French and J. C. Steinberg. Factors governing the intelligibility of speech sounds. The Journal of the Acoustical Society of America, 19(1):90– 119, 1947. [25] E. Larsen and R. M. Aarts. Audio Bandwidth Extension - Application of Psychoacoustics, Signal Processing and Loudspeaker Design. Wiley, 2004. [26] Global mobile suppliers association (GSA). Mobile HD voice: Global update report, GSM/3G market/technology update, May 21, 2012. http://www.gsacom.com/cgi/redir.pl5?url=http://www.gsacom.com/ downloads/pdf/GSA_Mobile_HD_Voice_report_210512_hdvz.php4, 2012. [accessed 19-August-2012]. [27] R. V. Cox, S. F. de Campos Neto, C. Lamblin, and M. H. Sherif. ITU-T coders for wideband, superwideband, and fullband speech communication. IEEE Communications Magazine, 47(10):106–109, 2009. [28] J. D. Gibson. Speech coding methods, standards, and applications. IEEE Circuits and Systems Magazine, 5(4):30–49, 2005. 82 Bibliography [29] International Telecommunication Union. ITU-T Recommendation G.114, One-way transmission time, 2007. [30] F. Hammer, P. Reichl, and A. Raake. The well-tempered conversation: interactivity, delay and perceptual voip quality. In Proceedings of the IEEE International Conference on Communications (ICC), pages 244–249, Seoul, Korea, 2005. [31] International Telecommunication Union. ITU-T Recommendation G.711, Pulse code modulation (PCM) of voice frequencies, 1972. [32] International Telecommunication Union. ITU-T Recommendation G.726, 40, 32, 24, 16 kbit/s adaptive differential pulse code modulation (ADPCM), 1990. [33] ETSI. Cellular history. http://etsi.org/WebSite/Technologies/ Cellularhistory.aspx. [accessed 10-June-2012]. [34] Gartner. Gartner says worldwide mobile connections will reach 5.6 billion in 2011 as mobile data services revenue totals $ 314.7 billion. http://www. gartner.com/it/page.jsp?id=1759714, 2011. [accessed 10-June-2012]. [35] A. S. Spanias. Speech coding: A tutorial review. Proceedings of the IEEE, 82(10):1541–1582, 1994. [36] A. M. Kondoz. Digital Speech: Coding for Low Bit Rate Communication Systems. Wiley, 2004. [37] P. Ojala, A. Lakaniemi, H. Lepänaho, and M. Jokimies. The adaptive multirate wideband speech codec: system characteristics, quality advances, and deployment strategies. IEEE Communications Magazine, 44(5):59– 65, 2006. [38] M. Poikselkä, H. Holma, J. Hongisto, J. Kallio, and A. Toskala. Voice over LTE (VoLTE). Wiley, 2012. [39] K. Järvinen, I. Bouazizi, L. Laaksonen, P. Ojala, and A. Rämö. Media coding for the next generation mobile system LTE. Computer Communications, 33(16):1916–1927, 2010. [40] 3GPP TR 26.935. Packet switched (PS) conversational multimedia applications; performance characterisation of default codecs, 2011. [41] A. Raake. Speech Quality of VoIP: Assessment and Prediction. Wiley, 2006. [42] J. Blauert and U. Jekosch. Sound-quality evaluation - a multi-layered problem. Acta Acustica, 83(5):747–753, 1997. [43] M. Guéguin, R. Le Bouquin-Jeannès, V. Gautier-Turbin, G. Faucon, and V. Barriac. On the evaluation of the conversational speech quality in telecommunications. EURASIP Journal on Advances in Signal Processing, 2008, 2008. [44] P. Loizou. Multimedia Analysis, Processing and Communications, chapter Speech quality assessment. Springer Verlag, 2011. [45] S. Möller and A. Raake. Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios. Speech Communication, 38(1-2):47–75, 2002. 83 Bibliography [46] T. Letowski. Sound quality assessment: Concepts and criteria. In Proceedings of the 87th Audio Engineering Society (AES) Convention (Paper D-8, Preprint 2825). [47] N. Côté. Integral and Diagnostic Intrusive Prediction of Speech Quality. Springer, 2011. [48] International Telecommunication Union. ITU-T Recommendation P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, 2003. [49] International Telecommunication Union. ITU-T Recommendation P.800, Methods for subjective determination of transmission quality, 1996. [50] International Telecommunication Union. ITU-T recommendation P.805, Subjective evaluation of conversational quality, 2007. [51] International Telecommunication Union. ITU–T Recommendation G.107, The E-model, a computational model for use in transmission planning, 2008. [52] International Telecommunication Union. ITU-T Recommendation P.862, Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, 2001. [53] International Telecommunication Union. ITU-T Recommendation P.862.2, Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs, 2005. [54] International Telecommunication Union. ITU-T Recommendation P.863, Perceptual objective listening quality assessment, 2011. [55] G. Fairbanks. Test of phonemic differentiation: The rhyme test. The Journal of Acoustical Society of America, 30(7):596–600, 1958. [56] W. D. Voiers. Evaluating processed speech using the diagnostic rhyme test. Speech Technology, 1:30–39, 1983. [57] A. S. House, C. E. Williams, M. H. L. Hecker, and K. D. Kryter. Articulation-testing methods: Consonantal differentiation with a closed response set. The Journal of Acoustical Society of America, 37(1):158–166, 1965. [58] International Telecommunication Union. ITU-T Recommendation G.729, Coding of speech at 8 kbit/s using conjugate-structure algebraic-codeexcited linear prediction (CS-ACELP), 2007. [59] U. Kornagel. Techniques for artificial bandwidth extension of telephone speech. Signal Processing, 86(6):1296–1306, 2006. [60] H. Gustafsson and U. A. Lindgren amd I. Claesson. Low-complexity feature-mapped speech bandwidth extension. IEEE Transactions on Audio, Speech, and Language Processing, 14(2):577–588, 2006. [61] H. Pulakka, U. Remes, S. Yrttiaho, K. J. Palomäki, M. Kurimo, and P. Alku. Low-frequency bandwidth extension of telephone speech using sinusoidal synthesis and gaussian mixture model. In Proceedings of Interspeech, volume 1, pages 1181–1184, Florence, Italy, 2011. 84 Bibliography [62] B. Geiser, P. Jax, P. Vary, H. Taddei, S. Schandl, M. Gartner, C. Guillaumé, and S. Ragot. Bandwidth extension for hierarchical speech and audio coding in ITU-T Rec. G.729.1. IEEE Transactions on Audio, Speech, and Language Processing, 15(8):2496–2509, 2007. [63] M. Nilsson, H. Gustafsson, S. V. Andersen, and B. Kleijn. Gaussian mixture model based mutual information estimation between frequency bands in speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 525–528, Orlando, FL, USA, 2002. [64] P. Jax and P. Vary. An upper bound on the quality of artificial bandwidth extension of narrowband speech signals. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 237–240, Orlando, FL, USA, 2002. [65] Y. Agiomyrgiannakis and Y. Stylianou. Combined estimation/coding of highband spectral envelopes for speech spectrum expansion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 469–472, Montreal, Canada, 2004. [66] A. H. Nour-Eldin, T. Z. Shabestary, and P. Kabal. The effect of memory inclusion on mutual information between speech frequency bands. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 53–56, Toulouse, France, 2006. [67] A. H. Nour-Eldin and P. Kabal. Objective analysis of the effect of memory inclusion on bandwidth extension of narrowband speech. In Proceedings of Interspeech, pages 2489–2492, Antwerp, Belgium, 2007. [68] H. Yasukawa. Signal restoration of broad band speech using nonlinear processing. In Proceedings of the European Signal Processing Conference (EUSIPCO), pages 987–990, Trieste, Italy, 1996. [69] T. V. Pham, F. Schaefer, and G. Kubin. A novel implementation of the spectral shaping approach for artificial bandwidth extension. In Proceedings of IEEE International Conference on Communications and Electronics (ICCE), pages 262–267, Nha Trang, Vietnam, 2010. [70] H. Carl and U. Heute. Bandwidth enhancement of narrow-band speech signals. In Proceedings of the European Signal Processing Conference (EUSIPCO), volume 2, pages 1178–1181, Edinburgh, Scotland, 1994. [71] P. Jax and P. Vary. Wideband extension of telephone speech using a hidden markov model. In Proceedings of the IEEE Workshop on Speech Coding, pages 133–135, Delavan, WI, USA, 2000. [72] H. Pulakka and P. Alku. Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2170–2183, 2011. [73] Y. Nakatoh, M. Tsushima, and T. Norimatsu. Generation of broadband speech from narrowband speech using piecewise linear mapping. In Proceedings of the European Conference on Speech Communication and Tech- 85 Bibliography nology (EUROSPEECH), volume 3, pages 1643–1646, Rhodes, Greece, 1997. [74] I. Y. Soon, S. N. Koh, C.K. Yeo, and W. H. Ngo. Transformation of narrowband speech into wideband speech with aid of zero crossings rate. Electronics Letters, 38(24):1607–1608, 2002. [75] B. Iser and G. Schmidt. Neural networks versus codebooks in an application for bandwidth extension of speech signals. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), pages 565–568, Geneva, Switzerland, 2003. [76] T. Ramabadran and M. Jasiuk. Artificial bandwidth extension of narrowband speech signals via high-band energy estimation. In Proceedings of the European Signal Processing Conference (EUSIPCO), Lausanne, Switzerland, 2008. [77] K.-T. Kim, M.-K. Lee, and H.-G. Kang. Speech bandwidth extension using temporal envelope modeling. IEEE Signal Processing Letters, 15:429–432, 2008. [78] P. Jax and P. Vary. On artificial bandwidth extension of telephone speech. Signal Processing, 83(8):1707–1719, 2003. [79] J. Makhoul and M. Berouti. High-frequency regeneration in speech coding systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 428–431, 1979. [80] H. Yasukawa. Wideband speech recovery from bandlimited speech in telephone communications. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), volume 4, pages 202–204, 1998. [81] N. Enbom and W. B. Kleijn. Bandwidth expansion of speech based on vector quantization of the mel frequency cepstral coefficients. In Proceedings of the IEEE Workshop on Speech Coding, pages 171–173, Porvoo, Finland, 1999. [82] S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter. Speech enhancement via frequency bandwidth extension using line spectral frequencies. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 665–668, Salt Lake City, USA, 2001. [83] C. Yagh and E. Erzin. Artificial bandwidth extension of spectral envelope with temporal clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 5096–5099, Prague, Czech Republic, 2011. [84] P. Jax and P. Vary. Enhancement of band-limited speech signals. In Proceedings of Aachen Symposium on Signal Theory (ASST), pages 331–336, Aachen, Germany, 2001. [85] U. Kornagel. Spectral widening of the excitation signal for telephone-band speech enhancement. In Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC), pages 215–218, Darmstadt, Germany, 2001. 86 Bibliography [86] C.-F. Chan and W.-K. Hui. Wideband re-synthesis of narrowband CELPcoded speech using multiband excitation model. In Proceedings of the Fourth International Conference on Spoken Language (ICSLP), volume 1, pages 322–325, Philadelphia, PA , USA, 1996. [87] D. G. Raza and C. F. Chan. Quality enhancement of CELP coded speech by using an MFCC based gaussian mixture model. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), pages 541–544, Geneva, Switzerland, 2003. [88] B. Iser and G. Schmidt. Bandwidth extension of telephony speech. EURASIP Newsletters, 16(2):2–24, 2005. [89] J. Epps. Wideband Extension of Narrowband Speech for Enhancement and Coding. PhD thesis, School of Electrical Engineering and Telecommunications, The University of New South Wales, 2000. [90] Y. Qian and P. Kabal. Dual-mode wideband speech recovery from narrowband speech. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), pages 1433–1436, Geneva, Switzerland, 2003. [91] R. Hu, V. Krishnan, and D. V. Anderson. Speech bandwidth extension by improved codebook mapping towards increased phonetic classification. In Proceedings of Interspeech, pages 1501–1504, Lisbon, Portugal, 2005. [92] T. Unno and A. McCree. A robust narrowband to wideband extension system featuring enhanced codebook mapping. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 805–808, Philadelphia, PA, USA, 2005. [93] F. Itakura. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoustical Society of America, 57(S1):35, 1975. [94] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4):357–366, 1980. [95] J. W. Paulus. Variable bitrate wideband speech coding using perceptually motivated thresholds. In Proceedings of the IEEE Workshop on Speech Coding for Telecommunications, pages 35–36, 1995. [96] P. Jax. Enhancement of Bandlimited Speech Signals: Algorithms and Theoretical Bounds. PhD thesis, Rheinisch-Westfälische Technische Hochschule Aachen, 2002. [97] Y. Yoshida and M. Abe. An algorithm to reconstruct wideband speech from narrowband speech based on codebook mapping. In Proceedings of the Third International Conference on Spoken Language Processing (ICSLP), pages 1591–1594, Yokohama, Japan, 1994. [98] C.-F. Chan and W.-K. Hui. Quality enhancement of narrowband CELPcoded speech via wideband harmonic re-synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 1187–1190, Munich, Germany, 1997. 87 Bibliography [99] J. Epps and W. H. Holmes. A new technique for wideband enhancement of coded narrowband speech. In Proceedings of the IEEE Workshop on Speech Coding, pages 174–176, Porvoo, Finland, 1999. [100] Y. Qian and P. Kabal. Wideband speech recovery from narrowband speech using classified codebook mapping. In Proceedings of Australian International Conference on Speech Science and Technology, pages 106–111, Melbourne, Australia, 2002. [101] K.-Y. Park and H. S. Kim. Narrowband to wideband conversion of speech using GMM based transformation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 1843–1846, Istanbul, Turkey, 2000. [102] A. H. Nour-Eldin and P. Kabal. Mel-frequency cepstral coefficient-based bandwidth extension of narrowband speech. In Proceedings of Interspeech, pages 53–56, Brisbane, Australia, 2008. [103] A. H. Nour-Eldin and P. Kabal. Combining frontend-based memory with MFCC features for bandwidth extension of narrowband speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 4001–4004, Taipei, Taiwan, 2009. [104] A. H. Nour-Eldin and P. Kabal. Memory-based approximation of the gaussian model framework for bandwidth extension of narrowband speech. In Proceedings of Interspeech, volume 1, pages 1185–1188, Florence, Italy, 2011. [105] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [106] M. Hosoki, T. Nagai, and A. Kurematsu. Speech signal band width extension and noise removal using subband HMM. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 245–248, Orlando, FL, USA, 2002. [107] G. Chen and V. Parsa. HMM-based frequency bandwidth extension for speech enhancement using line spectral frequencies. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 709–712, Montreal, Canada, 2004. [108] A. Uncini, F. Gobbi, and F. Piazza. Frequency recovery of narrow-band speech using adaptive spline neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 997–1000, Phoenix, AZ, USA, 1999. [109] D. Zaykovskiy and B. Iser. Comparison of neural networks and linear mapping in an application for bandwidth extension. In 10th International Conference on Speech and Computer (SPECOM), Patras, Greece, 2005. [110] A. Shahina and B. Yegnanarayana. Mapping neural networks for bandwidth extension of narrowband speech. In Proceedings of Interspeech, pages 1435–1438, Pittsburgh, USA, 2006. [111] S. Haykin. Neural Networks - A Comprehensive Foundation. Prentice Hall, 1999. 88 Bibliography [112] X. Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, 1999. [113] R. Miikkulainen. Encyclopedia of machine learning, Neuroevolution, New York, Springer. http://nn.cs.utexas.edu/?miikkulainen: encyclopedia10-ne, 2010. [accessed 10-June-2012]. [114] 3GPP TS 26.131. Terminal acoustic characteristics for telephony; requirements, 2011. [115] One World Nations Online. Most widely spoken languages in the world. http://www.nationsonline.org/oneworld/most_spoken_ languages.htm, 2012. [accessed 10-June-2012]. [116] P. Bauer, M.-A. Jung, Q. Junge, and T. Fingscheidt. On improving speech intelligibility in automotive hands-free systems. In IEEE International Symposium on Consumer Electronics (ISCE), Braunschweig, Germany, 2010. [117] QNX aviage acoustic processing suite. http://www.qnx.com/products/ acoustic/index.html, 2012. [accessed 10-June-2012]. [118] B. Iser and G. Schmidt. Receive side processing for automotive hands-free systems. In Proceedings of the Workshop on Hands-Free Speech Communication and Microphone Arrays (HSCMA), pages 236–239, Trento, Italy, 2008. 89 Bibliography 90 Errata Publication VII In section V, "The frequency response of a speaker phone microphone in a mobile device differs from the frequency response of high quality headphones", should be: "The frequency response of a speaker phone loudspeaker in a mobile device differs from the frequency response of high quality headphones". 91 Errata 92 9HSTFMG*afbceg+ I S BN9 7 89 5 2 6 0 5 1 2 4 6 I S BN9 7 89 5 2 6 0 5 1 2 5 3( p d f ) I S S N L1 7 9 9 4 9 34 I S S N1 7 9 9 4 9 34 I S S N1 7 9 9 4 9 4 2( p d f ) A a l t oU ni v e r s i t y S c h o o lo fE l e c t r i c a lE ng i ne e r i ng D e p a r t me nto fS i g na lP r o c e s s i nga ndA c o us t i c s w w w . a a l t o . f i A al t o D D6 4 / 2 0 1 3 T e l e ph o nec o nve rsat io n in no isy e nviro nme nt s is o ft e n diffic ul t .Bo t hspe e c h int e l l igibil it y and qual it y arede grade d by ambie ntno ise .F urt h e rmo re , mo sto ft h e mo bil eph o neuse rs arepro vide de ve nt o day o nl yw it ha l imit e d range( 30 0 340 0H z )o f vo ic efre que nc ie s duet ot h enarro w band spe e c hc o ding in c e l l ul ar ne t w o rks.T h is bandw idt his muc hnarro w e rt h an w h atis ne c e ssary fo rh ighspe e c hqual it y.T h e re fo re , c e l l ul ar o pe rat o rs aregradual l y st art ing t o suppo rtw ide band ( 50 70 0 0H z )spe e c h . H o w e ve r, during t h et ransit io n fro mt h e e xist ing narro w band syst e ms t ot rue w ide band t ransmissio n narro w band spe e c h ific ialb andwidt h c an bee nh anc e d by art ABE) .I nt h is t h e sis, ABE ext ensio n( me t h o ds t h ate nh anc espe e c hqual it y by adding h ighfre que nc yc o mpo ne nt st ot h e narro w band spe e c hsignalin t h ere c e iving mo bil ede vic earest udie d.T h ere sul t s indic at et h atABE me t h o ds e asespe e c h c o mmunic at io n, e spe c ial l y in ambie ntno ise , by pro viding t h euse ro ft h emo bil ede vic e spe e c ho fw ide r bandw idt ht h an w h atw as t ransmit t e dt h ro ught h ec e l l ul ar ne t w o rk. BU S I N E S S+ E C O N O M Y A R T+ D E S I G N+ A R C H I T E C T U R E S C I E N C E+ T E C H N O L O G Y C R O S S O V E R D O C T O R A L D I S S E R T A T I O N S
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement