Manish Arora. Suresh Babu P. V. and Vinay M. K
Emuzed India Private Limited, #839, 7thMain, 2"d Cross, HAL 2"d Stagc, Bangalore, India-560008
Phonc: +91-80-525222314, Email: arora, pvsuresh, vinaymk@emuzed.com (www.emuzcd.com)
Mobile Multimedia Messaging (MMS) promises to
provide a richer and versatile experience to the user
along with ncw revenue streams for mobile service
operators. MMS allows a full content range including
images, audio, video and text in any combination. It
delivers a location independent, total communication
expcrience to the mobile customer. As proved by the
success of Short Message Service (SMS) in generating
revenue MMS applications would be the essential
drivers of continuous growth in new services beyond
voice. Current programmablc mobilc handsets could
suffice as platform for simple MMS service
introduction if the existing capability of the RlSC
processor of the handset is exercised fully. Speech, an
important content in MMS has traditionally been
encoded and decoded on DSP processors. This paper
dcscribes the challenges and techniques of implementing
speech codecs on RlSC processors. Thc specific speech
codec implemented was GSM-4MR and thc processor
uscd was ARMBTDMI. The techniques described are
generic and applicablc lo any spcech codec and RISC
processor platform.
The global inobile communications industry is currently
evolving Crom voice drivcn communication to personal
multimedia communication. The mobile data services
industry has seen a tremendous growth and SMS
services have been the first to have financial impact on
the opcrator's revcnue. It is predicted that messaging
will lead the way to profitability in 3G as well [I].The
extcnsion of SMS, MMS requires high-speed networks
for transmitting messages that consist of much more data
than the current messages. General Packet Radio Service
(GPRS) creates an ideal platform for mobile data
applications and services. Since MMS is based on
Wireless Application Protocol (WAF'), the protocol
being bearer and network independent, it enables
multimedia messages from a GSMiGPRS network to be
sent to a TDMA or WCDMA network [I].
The evolution of messaging content and user
bencfits may take place as shown in figure I . From the
existing text messaging, the next step would be simple
picture messaging. This would be further enhanccd with
02002 IEEE
the facility to personalize and create content in the fonn
of graphics and evenhlally images with digital camera
and mobile phone intcrfacing. The picture would be
annotated with text or speech. The final step may be ncw
multimedia content such as audio and video clips [I].
TCXf n
Figure I . Messaging Content Evoiutian with Time
These applications would necessitate improvements in
mobile handheld devices and communication protocols.
Compression of next generation messaging content
would be of prime importance due to bandwidth and
storage constraints. To enable fast transfer of multimedia
content, they need to be subjccted to suitable
compression stages. These compression algorithms
would be implemented on target handheld devices. Ideal
candidates for the implementation of these image and
speech compression algorithms for the MMS application
would be DSP processors. But the existing
programmable mobilc devices combining personal
digital assistant and phones have limited programmable
processing power in the form of RISC processors only.
The hardware available is enough for simple applications
such as Internet browsing and word processing, along
with running the operating system and system functions
[ 2 ] . A possibility of implementing basic MMS on
existing devices lies in efficiently exploiting the
processing power of the RlSC processor. The need for
efficiency in implementation is critical due to limited
processing power and direct influence of the firmware on
the battery life of the device.
In this paper we describe methods for implementing
typical speech codecs on RlSC processors for MMS
solutions. Section 2 explains typical RlSC processor
hardware design advantages. Speech codecs and issucs
with implementing them on RISC architectures are
discussed in section 3. Section 4 suggests techniques for
efficient processor implementations. Section 5 provides
the results and conclusions of our target implementation
followed by the references.
DSP ZOO2 - 831
With the lmemct and wireless services coming together
digital handheld communication devices are expanding
rapidly in capability and hence complexity The fcatures
present in RISC architectures provide distinct
advantages in the design, development and product
capability for this expanding market. Thc emphasis of
the RlSC system design approach is s o h a r e while
maintaining simplicity in the hardware and high
compiler efficiency has been an important RlSC
hardware design goal. This makes the software design
process very simple reducing software development time
considerably for time and power crilical applications.
The processor has a single clock and a fcw very simple
instructions. LOAD and STORE are incorporated as
different instructions and the rest of the instruction set is
register to register. Instruction set orthogonality is easily
achievable for RlSC instruction sets. The disadvantage is
the resulting large code size. The hardware complexity
of such designs is low so more transistors can be spcnt
on general-purpose registers. This further improves the
overall compiler efficiency. A variety of efficiently
silicon sized, low power RISC cores are now available
for use in handheld devices powering applications like
consumer cntertainment, digital imaging, networking,
security and wirelcss devices etc. As an example to our
discussion on RISC processor, we discuss the very
widely used ARMYTDMI.
2.1. ARM9TDR11
ARM9TDMI is a member of ARM family of 32-hit
general-purpose RlSC microprocessors, targeted at
embedded applications where high performance, low die
sizc and low power are all important. The ARMYTDMI
provides a high instruction throughput and impressive
real-time interrupt response at a low small die size and
cost. The ARM9TDMI supports both the 32-bit ARM
and 16-bit Thumb instruction sets, allowing user to trade
off between high performance and high code density [3].
The Thumb instruction set contains commonly used
instructions at I6,bit sizes. Thumb instruction set can be
utilized at code size critical portions and complicated
DSP portions can be coded using the 32-bit ARM
instruction set. In contrast to previous processors such as
ARM7TDMI from ARM, which are based on Von
Neumann architecture, this device has a Harvard
architecture, and simple bus interface eases connection
to either cached or SRAWbased memory system. The
features of ARM9TDMI processor relevant to the
Speech Encoding and Decoding process are:
Fifteen general-purpose registers out of which one
register used while branching;
Conditional execution of instructions,
Five stage pipeline and Harvard architecture;
Block data transfer Instructions;
Instructions MUL (multiply) and MLA (multiply
and accumulate) with source operand dependant
execution cycle times;
Over the years speech compression algorithms have
evolved to provide good quality speech even at sub 8
kbps rates. Prominent among them are Code Excitation
Linear Prediction (CELP) based speech codecs. In which
the residual signal obtained after linear prediction (LP)
analysis on the input speech signal, is used l o choose the
optimal excitation code Vector by employing analysisby-synthesis configuration. With the advent of algebraic
CELP, speech coding has become more computation
intensive, while easing static memory requirements.
These compression methods have been standardized in
the last few years by ITU-T, ETSI and TIA in form of
various proposals.
Speech compression algorithms like most of the
other DSP algorithms are multiply-accumulate (MAC)
and looping intensive. These algorithms are instrumental
in the evolution of digital signal processors Because of
their specialized features like single cycle MAC
capability, zero overhead looping, specialized addressing
modes, hardware support for fixed point arithmetic and
parallel address generation units have made them the
most suitable platforms for implementation of wide
variety of DSP based algorithms. With fast interrupt
response, larger memory addressing capabilities and
inherent architectural advantages for efficient compiler
designs have made RISC processors more apt for control
codc flow management and operating systems.
The huge market for flexible, low power and low
cost single chip solutions for high volume consumer
applications has prompted many RlSC processor
vendors to provide limited [4] DSP functionality. For
example ARM9TDMI providcs an 8-bit by %bit booth's
multiplier. Also the native data path width of most RlSC
processors today is 32 bits. While this extended
precision over 16 hit DSP processors is an advantage for
applications such as audio, it is a problem for
standardized speech codecs. The speech coding
standards target 16 bit DSP processor implementations
and impose strict bit compliance. Software codes
provided for bit exactness verification by the standard
bodies is the simulation of 16 bit fixed-point arithmetic.
Optimizations in typical RlSC implementations are
targeted at the efficient use of the multiplier along with
other hardware features; looping overhead reductions,
load reductions for slower memory access systems etc.
These methods are discussed in detail in the next section.
Along with the documentation describing the algorithms
and references, the standard bodies provide an ANSI C
code corresponding to a fixed-point fractional arithmetic
implementation of the codec. A set of test vectors
comprising input and output streams are given to verify
the bit exact devclopment of the codec on the target
platform. The ANSI C code is witten in a manner to aid
porting to different platforms and in order to achieve this
the basic mathematical operations of addition,
DSP 2002 - 832
subtraction, multiplication etc. are implemented as
separate functions. This is done so that these operations
can be rewritten in the most optimized manner for the
specific DSP architecture. Also since the C language
does not support fixed-point fractional arithmetic all the
functionalized. Since the RISC architecture is very
different from that of a typical DSP it becomes necessary
to modify the standard code and generate a working
rcfcrcnce code. The modifications performed while
developing the rererence code addresses the issues like
memory load optimization, reducing function call
overhead, fixed-point arithmetic optimization, MAC
optimizations, compiler-specific optimizations and loop
4.1. Memory Load Optimization
One of the major techniques in developing the reference
code is to develop thc code at the processors most
optimized memory load width. This width. would
generally be more than 16 bits, the precision at which the
standard code is written. To maintain bit exacmess with
the standard, no extra precision is added and data width
increased with the use of extra sign bits. The data is
converted back to 16-bit standard code precision during
the final write to the output stream. There is a
considerable cycle gain achieved by generating the
reference code with this method since it reduces the stall
and wait cycles associated with sub word loads. As an
example for the ARM9TDMI half-word (16-bit) loads
induce pipeline stalls if the loaded registers are used
immediately. Now since we convert the same 16-bit data
into 32 bits without adding any extra precision and do
word length loads we do not gct any stalls. Further in
ARM9TDMI multiple load and store instructions are
available but operate only on word data and not on halfword data. So we get considerable codepize advantages
as well. From our implementation studies we found that
the resulting increase in data memory size requirement is
more than compensated by the corresponding reduction
in program code size requirement. Program code size
reduction has considerably more advantages that offset
the increase in data memory size due to the data width
increase. Since the program size is substantially larger
than the program cache a reduction in the program size
would reduce the program cache misses more than the
increase in the data cache misses due to increased data
memory requirements. In our experiments with a realtime device, while data cache hits remained almost the
same, program cache hits increased substantially.
4.2. Arithmetic Operation Optimization
As explained earlier the standard code implements every
fixed-point arithmetic operation by way of scparate
functions, to aid porting and profiling. These fixed-point
arithmetic functions are hand assembled for efficient
implementation on the target processor and inlined to
reduce function call overhead. In fixed point arithmetic
implcmentation saturation and overflow checks are an
important set of operations. The standard code simulates
these operations with additional checks and
comparisons. While fixed point DSP's have extensive
hardware support for these operations leading to almost
zero overhead implementation, RISC processor
implementations are cycle consuming operations. Thus
it is a good idea to reduce these checks wherever they
are redundant, by using the additional knowledge
available about the data values and their ranges. For e.g.
shifl operations involving a constant shift value can
avoid range compliance checks.
Low overhead
instructions for conditional instruction execution based
on status flags can be effectively used on RISC
processors for efficient implementation of these
saturation and overflow checks.
Lack of single cycle multipliedMAC units in RISC
processors make multiplication and divisions costly
operations relative to their DSP implementations.
Division/Multiplication with dyadic numbers are
converted into corresponding shifl operations. In cases
of constant non-dyadic
multipliers, multistage
combinations involving shift and ADDISLIB operations
can replace multiplication. For e.g. multiplication by 40
can be split up into two stages, first stage involving
independent shifts by 5 (multiply by 32) and 3 bits
. . by 8) followed by a second stage addition
The comolier oerformance for RISC orocessors is
substantially better than those for DSP processors. The
performance for our working reference code can be still
improved by following some simple steps. By the
judicious use of local variables compiler performance
can be improved considerably. We can try reducing the
number of local variables used in specific functions. The
code performancc improves significantly since the
complier is now able to dedicate registers to variables.
Intensive loops from critical functions can be rcplaced
by hand-assembled functions; the code performance
improves although at the cost of additional functional
overhead. Pointers in local variables used in critical
loops, should be avoided as it necessitates the use of a
memory location for the variable instead of a register.
Limiting the number of variables passed in functions can
reduce function-calling overheads. For functions with
large number of arguments, suitably encapsulating the
variables into a single data structure and passing its
address reduces call overhead.
4.3. Loop Optimization
After the development of the reference code critical
portions of the algorithm are hand assembled. Profile
results of the C code provide a good estimate of the
relative importance of functions. RISC assembly coding
when compared to DSP assembly coding is relatively
easy. But specialized techniques have lo be applied to
obtain the best performance of the hand-assembled code.
Since DSP codes are looping intensive all DSP
processors provide zero overhead looping (hardware
loop), where the loop count update and check are
performed in hardware without any cycle consumption.
DSP 2002 - 833
Looping overhead in software loops is considerable and
it i s directly proportional to total iterations in the loop.
the enamole of the correlation calculation code
i?': ,- figure 2.
. Reduction
~ in the
...of the ~
- ,....................
"~ ............. .,.,-.~,,
..l .
.. ...
, - ./._.*.
and 100 for Lag-Max we see that we have reduced the
loads from 32320 to approximately 11000. This load
reduction assumes more significance since in many
~ because
~ of usage
~ of slower
~ memories
~ to reduce
cost, there arc high latencies associated with memory.
Figure 2. Correlation Calculation C code.
internal loop would give significant cycle count
reduction. This can bc achieved by dividing the internal
loop count by a suitable factor (2-5) and performing that
many instructions inside the loop, effectively reducing
the number ofjuinps and loop end checks.
Anothcr powerful technique for improving performance
is reducing the number of loads in loops. This technique
is well applicable to many
DSP algorithms such
as filtering and correlation calculations. Consider the
exsmple of a correlation calculation C code in figure 2
apain. For every outer loop iteration we load
IScaledSigb] data again once. If the outer loop is
unrolled then along with the reduction in looping
overhead of the outer loop we can also use the first
loaded value of IScaledSigfi] for the rest of the unrolled
portions. When outer loop unrolling from this example is
combined with inner loop unrolling we can achieve
significant reductions in loads.
Reference code development and loop optimization
techniques suggested in Section 4 were applied to the
GSM-AMR [5] speech codec standard code. The
development was carried out on the widely popular
ARM9TDMI RISC processor. GSM-AMR with its
variety of bit mtcs is llrc idcal candidate for voice
content in Multimedia Messaging. Table 1 lists the cycle
counts achieved for various modes of operation of the
GSM-AMR codec in our development. The cycles
quoted are assuming zero wait states in memory access.
The program memory and constant table memory size
about lO0k bytes and the maximum stack usage was less
than 8k bytes. The codec was verified for bit exactness
with the standard reference streams. Table I also lists
comparisons with Emuzed's Decoder implementations
.,~ .. ....................
Table 1. Cycle counts in Mega Cycles for various
modes of GSM-AMR.
In this paper, we presented methodologies for realtime speech codec implementation un RlSC processor
architectures. The specific codec implemented was
GSM-AMR on the ARM9TDMI proccssor. These
methods are well applicable to any speech codec and
RlSC processor platform.
................................ ~.
.,; ......................................
....................... ...*...............
......... [ I ] Nokia Networks, "Nokia Multimedia Messaging,"
.=. , ..._.
..,,, . .-.,
http://nds 1.nokia.comlpressibackgroundlpdf/feb01
. .
Nokia Mobile Phones, pp. 1-8, Feb 2001.
~ ~ - = ,
[2] 0. Gunasekara, "Developing a digital cellular phone
using a 32-hit Microcontroller," White Paper,
ment, Ann Ltd, pp. 1-7.
[3] ARM Ltd., ARMYTDMl Techiiical Reference
M a m a / , Arm Ltd, ARM DDI 0145A, November 1998.
[4] Berkeley Design Technology Inc., Inside rhe ARM7,
ARM9 and ARM9E, http:/lw.bdti.comlproducts
[5] Universal Mobile Telecommunication System
(UMTS); Mandatory Speech Codec speech processing
functions, "AMR Specch Codec - Transcoding
Functions," 3GPP TS 26.090, Version 4.0.0, Release 4,
March 2001
Figure 3. Correlation Calculation Modified C code.
In figure 3 we demonstrate the unrolling of the outer
loop by 4 and the inner loop by 5 and subsequent data
rearranging. We are able to further reduce loads since
along with IScaledSigfi] values reuse, the values of the
second multiplication operand are also the same for
various values of i and j. As an example IScaledSigb-i]
is same as IScaledSig[(i+l)-(i+l) 1, IScaledSig[(i+Z)(i+2) ] and IScaledSig[(i+3)-(i+3) 1. Comparing the total
number of loads from the two examples we have the
loads in the original code (l+l)*lFrameLen*
(Lag-Max+l) reduced to (S+S)*(Lag_Max/4+1) *
(IFrameLeniS). Assuming values of 160 for IFrameLen
DSP 2002 834
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF