GeeAir: a universal multimodal remote control
Pers Ubiquit Comput (2010) 14:723–735
DOI 10.1007/s00779-010-0287-7
ORIGINAL ARTICLE
GeeAir: a universal multimodal remote control device for home
appliances
Gang Pan • Jiahui Wu • Daqing Zhang •
Zhaohui Wu • Yingchun Yang • Shijian Li
Received: 1 June 2009 / Accepted: 22 October 2009 / Published online: 10 March 2010
Ó Springer-Verlag London Limited 2010
Abstract In this paper, we present a handheld device
called GeeAir for remotely controlling home appliances via
a mixed modality of speech, gesture, joystick, button, and
light. This solution is superior to the existing universal
remote controllers in that it can be used by the users with
physical and vision impairments in a natural manner. By
combining diverse interaction techniques in a single
device, the GeeAir enables different user groups to control
home appliances effectively, satisfying even the unmet
needs of physically and vision-impaired users while
maintaining high usability and reliability. The experiments
demonstrate that the GeeAir prototype achieves prominent
performance through standardizing a small set of verbal
and gesture commands and introducing the feedback
mechanisms.
G. Pan (&) J. Wu Z. Wu Y. Yang S. Li (&)
Department of Computer Science, Zhejiang University,
Zhejiang, China
e-mail: [email protected]
J. Wu
e-mail: [email protected]
Z. Wu
e-mail: [email protected]
Y. Yang
e-mail: [email protected]
S. Li
e-mail: shiji[email protected]
D. Zhang
Handicom Lab, Institut TELECOM SudParis, Evry, France
e-mail: [email protected]
Keywords Universal remote controller Gesture recognition Speech recognition Smart home
1 Introduction
Nowadays, it is almost impossible for home inhabitants to
go for a day without interacting with the home appliances.
Although remote control of ‘‘home appliances’’ such as
TV, DVD, windows, lights, etc. serves well for ordinary
people with acceptable physical or emotional comfort, they
can provide more for the dignity, security, and well-being
of elderly or disabled people [1]. One can imagine a situation where a person has lost some of his/her physical
dexterity or mobility. In the absence of suitable controls,
he/she would need a caregiver to assist with the operation
of home appliances, with the attendant expense and loss of
independence and privacy. But with adequate assistance,
this person might be able to live independently at his/her
home.
The current home appliances are often equipped with
remote controllers operating via infrared (IR) light signals.
Each household is likely to own several remote controllers,
which are often incompatible with each other and have
different layouts. In order to reduce the number of remote
controls, universal remote controllers (URCs) were introduced to merge the functions of individual controllers into
one device [2–5]. A URC learns IR command sets from
each appliance and operates the appliance selected by a
user. There are two fundamental steps involved in the
control procedure of a URC: target object selection and
command issuing. To select a target object for operation, a
user might press a button, turn a rotary wheel, or touch an
icon depending on how the panel of the URC is designed.
To issue a command, a user needs to point the controller to
123
724
the target appliance and press a specific button on the
controller. Subsequently, the controller emits the infrared
signal to the selected appliance for the specified operation.
Although URCs combine the functions of remote controllers into one device, elderly and disabled home users
may still have difficulties in using a URC due to a number
of reasons: First, a URC has too many buttons that need to
be remembered, and several button presses may be needed
to achieve a simple function. Second, the buttons on a URC
may be too small for the elderly, physically disabled and
vision-impaired people to use. Finally, button operation is
just one modality to interact with the home appliances,
which may not be the most natural and efficient means for
human machine interaction.
Speech and gesture are two natural ways that people
interact with each other. Much research has been done to
use speech, gesture, or eye-gaze to control home appliances. However, there is limited success reported in the
literature on the deployment of these modalities due to the
constraint of each single modality. Controlling through a
spoken language or oral command is indeed straightforward for expressing intentions, but the single modality of
speech has the following limitations in real implementation: First the accurate extraction and recognition of control
commands from daily continuous speech is still difficult
due to the ambiguities of natural languages, especially in
noisy environments. Second, speech is not instant, e.g.
some commands need complex phases or sentences, which
may need a long time to process and react.
Using the single modality of gesture to control home
appliances has also been explored. Since the computer
vision-based gesture and eye-gaze control is highly
dependent on the lighting condition and camera facing
angle, it turns out to be rather difficult to accurately recognize gestures under poor lighting condition using a
camera-based system. In addition, it is also uncomfortable
and inconvenient if the user is required to face the camera
directly to complete a gesture. Different from the visionbased gesture recognition approach, the accelerometerbased gesture interaction is an emerging technique that
exploits the acceleration data of hand motion for recognition and control. No camera is required but a wearable or
portable accelerometer-equipped device in daily life, such
as a watch, a smart phone or a MP3 player. These wirelessenabled portable/wearable devices provide new possibilities for interacting with a wide range of home appliances
such as doors, window curtains, TVs, etc.
In this paper, we present a universal multimodal remote
control device which unifies several interaction modalities
such as speech, gesture, button, joystick, and light, so that
home inhabitants ranging from common users to elderly,
physically disabled, and vision-impaired people are all able
to interact with the home appliances in the way they feel
123
Pers Ubiquit Comput (2010) 14:723–735
comfortable. Specifically, we develop a universal multimodal remote controller, called GeeAir, which not only
provides comfort and convenience for common users in
controlling home appliances, but also meets the special
needs of physically and vision-impaired people in operating the home appliances to live independently and enjoy a
better quality of life.
The paper is organized as follows. First, the related work
on universal remote controllers and multimodal control
systems is summarized in Sect. 2. Then an overview of the
GeeAir system architecture is presented in Sect. 3. In Sect.
4, the key techniques to select the desired target appliance
for operation are described, followed by the introduction of
feedback mechanisms ensuring the reliable confirmation.
Section 5 proposes a standard set of hand gestures for
operating different home appliances and a novel algorithm
for the accelerometer-based gesture recognition. Section 6
reports the implementation details and the experimental
results of the speech/gesture recognition algorithms compared to other existing algorithms. An initial evaluation of
the GeeAir prototype with 10 users is also given in this
section. Finally, we provide our conclusions for the design
and test of GeeAir and highlight some future research
directions in Sect. 7.
2 Related work
In the consumer electronics market, several universal
remote control products can be found in the home electronics stores. These products can be roughly categorized
into two groups, according to how the target appliance is
selected: button-based URCs and screen-based URCs. The
former group allocates a few buttons in the control panel of
the URC for appliance selection, where one button corresponds to one appliance. For example, Phillips’ 4-in-1
URC has four buttons reserved in the panel to control TV/
VCR/DVD/SAT, respectively. Users select one of the four
appliances by pressing the corresponding button [2]. Since
the number of buttons in a URC control panel is fixed, the
extensibility of the button-based URCs is limited. The
screen-based URCs overcome this limitation by putting a
built-in mini-screen and a navigation button in the control
panel of URCs. When users press the navigation button, the
mini-screen shows the selected home appliance one after
another. When the target appliance appears in the screen,
the user completes the device selection by releasing the
button [3–5]. Apparently, both kinds of URCs only support
button-pressing as the single input modality, thus people
with limited motor skills, finger dexterity, or weak vision
might not be able to use these remote controls.
In parallel to the efforts of developing universal remote
controllers by consumer electronics manufacturers, there
Pers Ubiquit Comput (2010) 14:723–735
has been a lot of research on universal GUI to enable
mobile devices for home appliance control. Different
approaches have been proposed to generate the universal
graphical user interface in various mobile platforms [6, 7].
All those solutions assume that users can navigate the GUI
on the tiny screen of a mobile device with a pen or button.
Thus, they support only one single input modality and
consequently cannot meet the needs of elders and those
with certain physical or vision impairment.
Compared to the single modality solutions, multimodal
control systems combine the strengths of multiple modalities, and thus increase the applicability and usability of
human–machine interaction. To meet the different
requirements of varied users and applications, various
combinations of input and output modalities have been
explored in previous projects. For example, the seminal
work by Bolt [8] created a ‘‘Put-That-There’’ system where
people can use pointing gesture to select an object from a
virtual diagram of a room which is shown in a large-screen
display and subsequently use speech to operate on the
selected object. The EU HOME-AOM project [9, 10]
applied the mixed modality of speech, gesture, and GUI for
the home appliance control for disabled people, in which
speech and gesture were used to assist in the navigation of
GUI commands. GWindows [11] operated the Microsoft
window applications by using speech to move/close/minimize/maximize/scroll and using motion gestures to determine the movement distance. Krum et al. [12] implement a
system that helps user navigate in a whole earth 3D visualization environment at a distance from the display. It
employs Gesture Pendant [13] for tracking of simple hand
motions and utilizing speech for navigation commands.
Different from those projects, our work intends to provide a
single, multimodal control device for a wider range of
home users, including the elders and those with physical or
vision impairment besides ordinary users. Our solution
supports a mixed modality of speech, gesture, button,
joystick and light as input and output, adapting to different
needs and interaction preferences of various user groups. In
addition, we use an accelerometer-based gesture recognition approach instead of the camera-based one used previously, which allows users to move freely in a ubiquitous
home environment and control the home appliance in any
lighting condition.
The closest research to our work is by Kela et al. [14]
who used several modalities to interact with a design studio
environment. The modalities explored include speech input
and output, gesture input, RFID-tag, a laser-tracked pen
and a mobile device with touch screen. Our work differs
from theirs in the following aspects:
(1)
While Kela et al’s work uses diverse modalities in a
studio environment, they deploy multiple devices to
725
(2)
(3)
(4)
control multiple applications, and we focus on
building a handy, single multimodal device for
controlling multiple home appliances.
Kela et al’s work takes the design studio as the
application environment, the designers as the user
group, and convenience and comfort as the design
goal. Instead, our research aims at a different, actually
larger, user group. We not only provide ordinary
home inhabitants with convenience and comfort, but
also elders and those with physical and vision
impairment. For example, we provide joystick as
one input modality which is very useful for people
with hand disability.
In order to ensure the reliability and robustness of the
multimodal remote controllers for elders and disabled
people, we introduce voice and light as feedback. So
that the desired control object can be reliably
identified even if speech recognition is not 100%
accurate. In our GeeAir solution, users are allowed to
use speech or joystick to select a target appliance for
operation and use voice and light to get feedback.
Such solution can satisfy the needs of user groups
with disabilities in speaking, hearing, vision, and
hand.
Although we also use accelerometer-based approach
for gesture control as Kela et al. did, we developed
a novel and very different algorithm [15] which is
more accurate than the algorithm used in Ref. [14].
While they adopted a HMM (hidden Markov
model)-based approach for gesture recognition and
process the acceleration data in the time domain
without conducting feature extraction, we processed
the data in frequency domain with feature extraction
to reduce the noise and variation of a gesture data,
thus significantly improving the recognition
performance.
3 GeeAir: an overview
The design goal of GeeAir is to become a single universal
remote controller which serves not only common users but
also those physically disabled and vision-impaired people.
In the home environment as illustrated in Fig. 1, GeeAir
takes the inputs from the users to select a target appliance
first and then recognizes the predefined hand gesture of
users to control the selected target appliance. As described
before, the mixed modalities of speech, joystick, light, and
button are used for selecting a desired target appliance. In
order to avoid any potential error during the selection, two
feedback mechanisms are introduced in GeeAir design:
lighting feedback and voice echo.
123
726
Pers Ubiquit Comput (2010) 14:723–735
Fig. 1 Illustration of the
GeeAir for remote control of
home appliances
The look and feel of GeeAir prototype is shown in
Fig. 2, which borrows the design from Nintendo Nunchuk.
The key components of GeeAir and their functionalities are
described as follows:
(1)
(2)
(3)
(4)
(5)
(6)
A three-axis built-in accelerometer: to capture users’
3-D hand gesture signals.
An eight-orientation joystick: to select a target
appliance efficiently.
A built-in microphone: to acquire users’ speech
commands.
A speaker: to provide users with voice feedback and
reminders.
Button A and B: used to label the beginning and end
of speech and gesture commands. These two buttons
are designed in different sizes and shapes, in order to
help user differentiate them by tactility.
A built-in digital signal processing unit: to handle the
computation involved in the processing of multimodality inputs and outputs.
Fig. 2 Conceptual illustration
of the GeeAir’s components for
multimodal control. a a threeaxis accelerometer, joystick,
microphone, speaker, and two
buttons are built in GeeAir;
b two buttons (Button A and
Button B) in the front view of
GeeAir
123
(7)
A built-in communication unit: to send and receive
wireless signals.
The workflow of using GeeAir consists of three main
stages: appliance selection, feedback and confirmation, and
operation command issuing, as shown in Fig. 3. At any
moment, GeeAir has a current appliance for operation. The
current appliance is indicated by the light signal or voice
reminder. If a user intends to control another appliance
rather than the current appliance, he/she needs to select the
desired one via joystick or speaking the target appliance
name. If speech is used, GeeAir will obtain the name of the
target appliance with speech recognition. The feedback for
appliance selection has two options: light signal (a controllable light attached to each appliance) and voice echo,
which help users correct occasional errors of speech recognition of the target appliance name. If the current
appliance is exactly the one that the user wants to operate,
the user can wave the GeeAir in air for the follow-up
operations. Then the gesture will be recognized by GeeAir
Pers Ubiquit Comput (2010) 14:723–735
727
the natural utterance, users are asked to press Button A on
GeeAir to start speaking the appliance name for object
selection, and release the button after speaking the appliance name.
For isolated word recognition, the commonly used
techniques include VQ (Vector Quantization), DTW
(Dynamic Time Warping), and HMM (Hidden Markov
Model) [16, 17]. For GeeAir, we build an isolated word
recognition system based on continuous density hidden
Markov model (CDHMM) [18]. The whole recognition
process consists of the following steps:
B
Begin
Select a target appliance
Rotate
joystick
Wrong
or
Speak
its name
Feedback
Voice
Signal
light
or echo
(1)
(2)
Right
Operate the current appliance with
gesture and command issuing
Yes
Continue operating
the current appliance?
(3)
No
(4)
(5)
Fig. 3 Workflow of GeeAir
and the corresponding command will be issued to the
current appliance wirelessly.
Defining the lexicon: recording the words to be
recognized by the system. Each word is repeatedly
recorded several times by each participant.
Feature extraction: the MFCC (Mel Frequency Cepstrum Coefficient) feature vectors [19] are computed,
together with their first derivatives.
Modeling words: for each word in the lexicon, a leftto-right CDHMM is built with a number of states.
Each state is characterized by a Gaussian mixture
model (GMM).
Training the models: the parameters of the distributions in GMM and the state transition probabilities
within CDHMMs are estimated using the BaumBelch algorithm [17].
Recognition of a word: first, we compute observations
of the word (feature vector), and then the probability
of its observations is generated from each of the
words CDHMM models using the Viterbi algorithm.
The word is recognized to be the one whose model
has the highest probability.
4 Multimodal selection of a target appliance
4.2 Selecting via joystick
4.1 Selecting via speech commands
The second modality GeeAir provides to select a target
appliance is through the built-in joystick. Joystick is a
traditional input device in machine control of trucks, CT
scanner, as well as video games. It outperforms buttons in
navigation due to its continuity, fast reaction and nearly no
relative movement between hand and itself during the
controlling process. Thus, joystick is a good choice for
selecting objects which are circled around in the spatial
space.
The operation principle of the joystick is illustrated in
Fig. 4. The accessible area is octagonal. There are two
states defined for joystick operation: inactive and active.
Inactive state indicates that joystick is not pushed and stay
in the middle of the octagon; active state indicates that
joystick is pushed to the edge of the octagonal at any angle.
The eight valid joystick positions are: north, northeast, east,
southeast, south, southwest, west, and northwest. Each
position occupies 45 degrees.
Speech is one of the most natural ways for interaction
between human and machines. However, for home
appliance control, it is still a great challenge to robustly
extract and recognize the control commands in real life
environment using user-independent large vocabulary
continuous speech recognition technology. In contrast,
small vocabulary recognition of isolated words is quite
reliable and accurate, verified by many successful practical applications.
GeeAir provides the option of selecting a target appliance via speech commands. GeeAir will record users’
utterance through the equipped microphone and then recognize the appliance name. In this case, the vocabulary to
be recognized is small because the number of home
appliances is limited and their names are relatively fixed. In
order to avoid the segmentation of the appliance name from
123
728
Fig. 4 Octagonal accessible area of joystick. Each position covers 45
degrees. The joystick can be rotated either clockwise or counterclockwise to change the position
A user can move the joystick along the octagon to select
appliances in the physical spatial space. Intuitively, an
octagonal joystick can be matched to eight appliances
statically. However, to select the target appliance from the
different number of appliances in each household, GeeAir
exploits the rule of dynamic and relative association
between the positions and the appliances. A valid position
is not necessarily associated with a fixed device. In this
sense, when a user intends to select an appliance, the initial
position which he/she pushes the joystick to first is
dynamically associated with the current selected appliance.
While the user rotates the joystick to a neighboring position, the current appliance will also shift to its neighboring
appliance. Whether the left nearest one or the right nearest
one is selected depends on the user’s rotating direction, i.e.
counter-clockwise and clockwise. The dynamic association
ensures the flexibility when the number of appliances
varies. Thus, any number of appliances can be easily
navigated by using the joystick.
4.3 Feedback mechanism
GeeAir has two kinds of feedback mechanisms available for
confirmation purpose: voice echo and signal light. GeeAir
has a built-in mini-speaker, which can replay the name of the
appliance when the appliance is selected by either speech or
joystick. Voice echo informs the user whether the object
recognized by the system is the desired one that users intend
to select. If a controllable LED light is attached to each
appliance, the lights can be used as a feedback, i.e. the red
LED light of the selected appliance is turned on for user
confirmation while the other lights are keeping off.
For joystick-based appliance selection, the light feedback will immediately occur as soon as the joystick
changes a position, that is, when the joystick moves from
one position to another, the light signal will also shift from
one appliance to the next. The instant lighting during
joystick rotation will be much helpful for user due to the
quick response of joystick operations. However, the voice
echo cannot occur for every covered position if joystick
123
Pers Ubiquit Comput (2010) 14:723–735
rotates too fast because there is no enough time for voice
echo. For this reason, GeeAir sets a movement speed limit,
one position/second, for voice echo. If the joystick stays in
a position for less than 1 second, the voice echo of the
appliance associated dynamically with this position will be
suppressed. Any voice echo can be interrupted by rotating
joystick to the next position when users know that the
current one is not the desired one, which helps users to
speed up the selection process.
With the feedback mechanisms, if the user finds the
recognized object is not the desired one, he/she can correct
it immediately by repeating the appliance selection. Thus,
the command issuing for a wrong appliance could be
avoided. Any of the two feedback mechanisms can be
combined with one of the two selection schemes introduced
previously, i.e., there are four combinations available:
speech-voice, speech-light, joystick-voice, joystick-light.
Both feedback modalities of voice and light are suitable
for motor-impaired people, they also free users from
reading on-screen prompts. The voice-based feedback is
suitable for any people with normal hearing. Although the
signal light requires users’ vision, it is less demanding to
recognize the binary states, ON, and OFF, of a light, than
the semantic information in text or picture on screen.
5 Operating an appliance via gesture
After the target appliance is selected, GeeAir uses gesture
commands to operate it. Gestures performed by GeeAir are
recognized based on acceleration data acquired by the builtin three-axis accelerometer [15]. Compared to the camerabased gesture recognition techniques [20], the accelerometer-based gesture recognition does not rely on lighting conditions and camera facing angle, and also does not require
any deployment of devices in the environment. Similar to
issuing speech commands, users begin a gesture by pushing
the Button B, and end it by releasing the button, avoiding the
accuracy degradation caused by gesture segmentation.
5.1 Gesture command definition
In order to enable effective gesture-based interaction,
several requirements must be met when designing a set of
gesture commands for home appliances: (1) the semantic
connection between gestures and commands should be
natural, so that the meaning of a gesture is easy to learn and
remember for users; (2) gestures should be simple and
terse, avoiding those require high precision over a long
period of time. Moreover, they should be quick to perform
and repeat, without causing fatigue over time; (3) the
gesture commands for different appliances should be consistent, i.e., similar operations of different appliances
Pers Ubiquit Comput (2010) 14:723–735
729
Table 1 Definition of gesture commands for appliances
Appliance
Gesture commands
Forward–backward
Up; down
Left; right
Television
ON/OFF
Vol. up
Prev. channel
Vol. down
Next channel
DVD
ON/OFF
Vol. up
Vol. down
Vol. up
Prev. channel
Vol. down
Next channel
Radio
ON/OFF
Speaker
ON/OFF
Air conditioner
ON/OFF
Double-left; double-right
V; inverted-V
Prev. track
F Forward
Play/pause
Next track
F Backward
Stop
Vol. up
Vol. down
Temp. up
Temp. down
Lamp
ON/OFF
Curtain
Open/Close
Brtn. up
Brtn. down
Curt. up
Curt. down
Vol, ‘Volume’; F Forward, ‘Fast Forward’; F Backward, ‘Fast Backward’; Temp, ‘Temperature’; Brtn, ‘Brightness’; Curt, ‘Curtain’
should be defined as the same gesture to reduce the size of
gesture vocabulary which the users have to learn.
Usually there are two different ways employed in gesture
command definition: user-dependent and user-independent.
Previous work focuses more on user-dependent gesture
recognition [21–23], where each user is required to perform a
couple of gestures as training/template samples before using
the system. In this case, users are requested to personalize a
remote controller by mapping each operation to a certain
gesture they find suitable and comfortable. However, the
training process is still a burden for users, although some
work [23, 24] has been done on optimizing recognition
algorithms to reduce the size of training sample set. GeeAir
aims at user-independent gesture recognition and control.
Different users will share a common set of gesture commands and do not need to train GeeAir from person to person.
In this paper, we define a nine-gesture vocabulary to
control the frequently used functions of seven categories of
home appliances, as listed in Table 1. The gesture of
Forward–Backward is performed in the X–Y plane, and the
other eight gestures are waved in the Y–Z plane.
(1)
(2)
(3)
The gesture of Forward–Backward is performed as if
pushing an ON/OFF switch button on a control panel
of electronic appliances.
The swinging gestures of Up and Down are very
natural to express the meaning of up and down, e.g.
volume up/down, temperature up/down.
Similarly, the two gestures of Left and Right are also
natural to represent the meaning of ‘previous’ and ‘next’.
(4)
(5)
(6)
The gestures of Double-Left and Double-Right denoting a fast move toward left or right suggest users of
‘fast backward’/‘fast forward’.
The gesture of alphabet V implying a tick or rising
up suggests a ‘Play’ operation. Additionally, we
follow the tradition that most of the current players
use the same button to share operations of ‘Play’ and
‘Pause’.
The gesture of Inverted V implies a decreasing trend,
which we define as a ‘Stop’ operation.
Specifically, however, Up/Down and Double-Left/Double-Right are continuous commands rather than instant
ones, for example, modulating the volume or adjusting
curtains is a continuous operation. In order to avoid frequently performing the same gesture, when such commands are recognized, GeeAir will continuously issue the
commands with a certain interval until users push Button B
or it reaches to its maximum.
5.2 Gesture recognition with FDSVM
GeeAir employs the algorithm FDSVM [15], proposed by
the authors, to recognize gesture commands from acceleration data. FDSVM uses a frame-based descriptor to
compactly represent a gesture, which reduces noise and
variation of a gesture data, and thus improves the gesture
recognition performance significantly.
The FDSVM system has two main phases—training and
recognizing—and four components—acceleration data
123
730
Pers Ubiquit Comput (2010) 14:723–735
Fig. 5 Block diagram of the
FDSVM gesture recognition
system
Training
SVM
Acceleration
Data
Acquisition
Feature
Extraction
Recognition
by SVM
Frame
Segmentation
Feature
Calculation
Fig. 6 Illustration of segments
and frames for a gesture
Frame 1
Frame 0
Segment 0
Segment 1
Frame N-1
.
.
.
Segment 2
... ...
Segment N
Gesture
acquisition, feature extraction, training SVM, and recognition by SVM, as shown in Fig. 5. The former two components are shared by the training and recognizing phases.
3 axes, and N frames per gesture, the dimension of the
feature vector should be d = 53N = 15N.
5.2.2 Gesture classification: multiclass SVM
5.2.1 Feature extraction: frame-based gesture descriptor
The three-axis accelerometer built in GeeAir can discretely
sense the gestural acceleration data of three spatial
orthogonal axes. We denote a gesture command as:
G ¼ ðax ; ay ; az Þ
where ax, ay, az are the acceleration sequences from three
axes. We divide a gesture into N ? 1 segments with identical length, and every two adjunct segments make up a
frame with a segment-length overlap, as illustrated in Fig. 6.
We employ five features in both frequency and spatial
domain to characterize each frame:
In frequency domain (discrete Fourier transform (DFT)
on each frame per axis):
(1)
(2)
(3)
mean l: the DC component over the frame
energy e: the sum of the squared DFT component
magnitudes except the DC component, and subsequently divided by the number of the components for
the purpose of normalization.
entropy d: the normalized information entropy of the
DFT component magnitudes with the DC component
excluded.
In spatial domain:
(4)
(5)
standard deviation r: indicates the amplitude variability of a gesture
correlation c among the axes: implies the strength of
a linear relationship between each pair of axis.
We combine all features extracted as described above to
form a feature vector s, which represents the gesture
command itself. Considering 5 features per frame per axis,
123
Suppose there are two types of gestures GTR1, GTR2
needed to be classified. We denote the training set with n
samples as
fðsi ; gi Þg; i ¼ 1; . . .; n
where si 2 Rd represents a feature vector of a gesture
command and
þ1; if si belongs to GTR1
gi ¼
1; if si belongs to GTR2
A separating plane written as
wsþb¼0
which can be obtained by solving a dual convex quadratic
programming problem [25].
The extension to multiple gestures classification is
achieved by a multiclass SVM using one-versus-one
strategy or one-versus-all strategy. SVM is a method to
deal with the highly non-linear classification and regression
problems. Benefiting from structural risk minimization
principle and avoidance of over-fitting by its soft margin,
SVM usually outperforms the traditional parameter estimation methods which are based on Law of Large Numbers when there are merely limited training data available.
6 Evaluations
6.1 Implementation
We build a prototype of GeeAir, including hardware and
algorithms implementation, to verify the design and
Pers Ubiquit Comput (2010) 14:723–735
731
Fig. 7 Components of the
Bluetooth–infrared adaptor
performance. Currently, the GeeAir can acquire speech and
gesture commands with two buttons, and perform joystickbased selection. The software, including algorithms for
speech recognition and gesture recognition, is still implemented on a PC instead of GeeAir. We use Bluetooth to
connect the GeeAir and the PC.
6.1.1 Hardware setup
Table 2 Speech vocabulary of twelve Chinese words for seven
appliances
No.
Appliances
1
Television
Chinese words
Diàn shı̀
Diàn shı̀ jı̄
2
DVD player
3
Radio
DVD
Shōu yı̄n
Shōu yı̄n jı̄
The GeeAir prototype is built based on Nintendo Wiimote
for acceleration sensing and its expansion Nunchuk for
joystick selection. It has a 3-D accelerometer, a joystick,
and two buttons: Button A and Button B (inspired by
Button C and Button Z of Nunchuk). The built-in microphone and speaker of GeeAir are simply replaced with
Bluetooth wireless headphone connected to a laptop computer. Wiimote is also employed to help build communication between the laptop computer and GeeAir.
GeeAir utilizes Bluetooth as the non-directional wireless
communication. However, most of current appliances
adopt infrared remote controllers and therefore are unable
to receive Bluetooth signal. We developed a Bluetooth–
infrared Adaptor (BI Adaptor) to convert the Bluetooth
signals to infrared signals, which will be unnecessary when
the appliances are able to communicate via Bluetooth. Also
the signal light for feedback mechanism is embedded on
the BI Adaptor, shown in Fig. 7.
6.1.2 Algorithms implementation
For the isolated word recognition in GeeAir, the lexicon has
12 words for seven categories of home appliances, shown in
Table 2. The utterances are recorded with 16 kHz sampling
frequency and 16-bit resolution. The feature vector of 26
dimensional MFCC (13 dimensional cepstrum coefficients
and their first derivatives) is employed, which is computed
with a window size of 32 ms and a step size of 16 ms. Each
4
Speaker
Yı̄n xiǎng
5
Air conditioner
Yı̄n xiāng
Kōng tiáo
6
Lamp
Diàn dēng
Tái dēng
Rı̀ guāng dēng
7
Curtain
Chuāng lián
word is represented by a trained left-to-right CDHMM
model with 3 states, which is implemented on the base of
HTK (Hidden Markov Toolkit) [26]. The eight-dimension
mixture Gaussian distribution is used for modeling states.
We use 6 Baum-Welch re-estimation iterations.
Gesture recognition with FDSVM for GeeAir uses an
open source software package of FFTW [27] for discrete
Fourier transformation. Then five features mean, energy,
entropy, correlation, and standard deviation of individual
axis in one frame are calculated. The feature vector is
eventually put into a classifier in order to train an SVM
model or retrieve a recognized gesture type. The SVM
component utilizes the package SVMmulticlass [28]. The
details may refer to the reference [15].
6.2 Data acquisition
To evaluate the GeeAir’s performance of oral command
recognition and gesture recognition, we built a speech
123
732
database with 7 appliance names and a gesture acceleration
database with 9 gestures. Both databases are acquired by 10
persons, including 5 males and 5 females. The collection
procedure lasts 5 days.
The vocabulary in speech database includes 12 Chinese
words of 7 appliances, listed in Table 2. Some of the
appliances may have more than one name, depending on
users’ habits. Each user is required to record 4 times per
word per day. Thus, each user has 20 samples for each
Chinese word.
For the gesture acceleration database, each participant
was asked to perform each gesture for 6 repetitions per
day. Thus, there are 6 9 5 9 9 9 10 = 2,700 samples.
The start and end of a gesture are labeled by pressing the
Button B on the Wiimote during data acquisition. Figure 8 illustrates the acquisition devices. We divided the 9
gestures into 3 groups as listed in Table 3, for the purpose of evaluating usability for different potential appliances. For example, Group 1 is for speaker, air
conditioner, lamp, and curtain; Group 2 is for television
and radio.
We employed the leave-one-day-out cross validation for
the user-dependent case and the leave-one-person-out cross
validation for the user-independent case in speech and
gesture experiments. For the leave-one-day-out cross-validation, we divide all the samples into five partitions,
choosing 1 day’s samples for a partition (namely 60 samples per gesture per partition, and 40 samples per word per
partition). At each time, four partitions from five are for
training, the remainder of one partition is for testing. We
then repeat it five times and finally take the average recognition rate. For the leave-one-person-out cross-validation, nine participants’ data (out of ten) is used as the
training set; the data of the remaining participant is used as
the testing set.
Pers Ubiquit Comput (2010) 14:723–735
Table 3 The nine gestures are divided into three groups for the
gesture recognition experiments
No. Size Gesture
1
3
Forward–backward, up, down
2
5
Forward–backward, up, down, left, right
3
9
Forward–backward, up, down, left, right, double-left,
double-right, V, inverted-V
Recognition rate
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
Day 1
Day 2
Day 3
Day 4 Day 5 Average
Fig. 9 User-dependent speech recognition result varying over time
6.3 Speech recognition accuracy
Using the 12-word speech data described previously, the
experimental results shows that the user-dependent speech
recognition achieves the accuracy of 98.21%, and userindependent performance has the recognition rate of
91.79%. Figure 9 illustrates the recognition performance
over time in the user-dependent case.
6.4 Gesture recognition accuracy
6.4.1 Experiment 1: effect of frame number N
Fig. 8 Acquisition devices of gesture acceleration data
123
The purpose of analyzing a gesture in frames rather than as
a whole is to describe its local characteristics corresponding to time span. The frame count N indicates the precision
we know about a gesture. Intuitively, the more frames a
gesture is broken up into, the more details are known about
the gesture. However, it may lead to the over-fitting
problem if the frame number N is too large. It will also
increase the dimension of the feature space, which
increases computational complexity. This experiment is to
examine the effect of varying N.
Figure 10 shows the experimental results for varying
frame number N using the data set of Group 3. As can be
seen, higher-rating occurs at the center in both curves and
lower-rating at both ends. This result supports our
assumption that the feature will convey little discriminative
Recognition Rate
Pers Ubiquit Comput (2010) 14:723–735
733
6.4.3 Experiment 3: user-independent gesture recognition
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
user dependent
user independent
2
3
4
5
6
7
9
11
13
15
17
19
Frame Number
Fig. 10 Experimental result for various frame number N
information when N is too small, and the over-fitting
problem will occur when N is too large. The recognition
accuracy is obviously lower than the rest when N is 2. The
two curves are nearly flat when N is between 4 and 7. In the
following experiments, we choose N = 5.
6.4.2 Experiment 2: user-dependent gesture recognition
In this experiment, to demonstrate the performance of our
method, we compare it with four methods: decision tree
C4.5, Naı̈ve Bayes, DTW, and the HMM algorithm. We
employed the implementation of C4.5 by Quinlan [29] for
comparison purpose.
We carried out the experiments and comparison tests on
the 3 groups of data set, respectively. The comparison
results are shown in Fig. 11. When recognizing the three
gestures of Group 1, all the five approaches obtain the
recognition rate of more than 90%, where our proposed
FDSVM achieves 99.17% (a little bit lower than DTW,
99.76%). When the number of gesture type increases, the
performance of HMM and DTW decreases significantly. In
contrast, our FDSVM method performs well even in recognizing all the 9 gestures, with the recognition rate of
96.40%.
Naïve Bayes
C4.5
DTW
6.5 Response time test
We have set up 8 home appliances as control objects in the
laboratory: a curtain, two lights, a TV, an air-conditioner, a
speaker, and a DVD player. We then recruited 10 graduate
students in the laboratory for the experiments, none of
whom used the GeeAir before. A series of tasks were
defined as follows in order to test each user one after
another:
1.
2.
3.
Use speech to select a target appliance (one of eight).
After a red light feedback from the system for
confirmation, conduct gestures to control the
appliance.
Use the joystick to repeat the same task as Step 1.
Cover the eyes of each participant to simulate the
situation for a blind person, using speech to select a
target appliance (one of eight). After a voice feedback
from the system, conduct gestures to control the
appliance.
FDSVM
HMM
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Naïve Bayes
C4.5
100%
DTW
HMM
90%
Recognition Rate
Recognition Rate
FDSVM
User-independent case means that the system is welltrained before users use it. Such implementation avoids
users’ efforts to perform several gestures as training data.
The results of user-independent gesture recognition test
and comparison are shown in Fig. 12. Obviously, the recognition rate of user-independent gesture recognition is
lower than that of user-dependent one. Our FDSVM has
very stable recognition performance when the number of
gesture types increases. It achieves the recognition rate of
94.17% for 3 gestures of Group 1 and 91.07% for 9 gestures of Group 3. DTW achieves recognition rate of
97.38% for Group 1 and 95.78% for Group 2, slightly
outperforming our methods. However, our FDSVM significantly outperforms DTW in 9 gestures of Group 3. The
result reveals that our FDSVM has good generalization
capability with respect to the number of gesture types.
80%
70%
60%
50%
40%
30%
20%
10%
0%
1
2
3
Group No.
Fig. 11 Experimental results for the user-dependent case
1
2
3
Group No.
Fig. 12 Experimental result for the user-independent case
123
734
Pers Ubiquit Comput (2010) 14:723–735
Table 4 Average response time of different stages (unit: millisecond)
Target selection
Feedback
Gesture (action ? recognition)
Joystick
1266
Light
43
Speech (speaking ? recognition)
1397 ? 406
Voice
736
4.
Use joystick to repeat the same task as Step 3.
Table 4 shows the average response time of different
stages when students use the GeeAir prototype. We can see
that it is faster to select a target using joystick than speech
because selection by speech needs lots of time (i.e. 1.4 s) to
speak an appliance name. The computational cost for recognition of both speech and gesture is less than 0.5 s. For a
user, response time of feedback by light is nearly negligible
(only 43 ms). For the procedure of gesture command,
including gesture action and gesture recognition, the
average time spent is 0.483 s.
7 Conclusions
We have developed a handheld, universal multimodal
remote control device, called GeeAir, for controlling home
appliances/appliances via a mixed modality of speech,
gesture, joystick, button, and light. Compared to the
existing universal remote controllers, GeeAir can enable
even those with physical, hearing, and vision impairment to
control home appliances in a natural manner. Compared to
the existing multimodal solutions interacting with the smart
environments, GeeAir provides a handy and single device
solution, not only providing comfort and convenience for
common users in controlling home appliances but also
meeting the special needs of physically and visionimpaired people in operating the home appliances.
Single modality such as speech, gesture, joystick, button, and light all has its own strength and weakness. By
combining those diverse but complementary modalities
together and integrating them into a single device, different
home user groups can always find a combination of
modalities they feel comfortable to interact with the environment. GeeAir represents an interesting attempt toward
bringing the multimodal interaction techniques closer to
the everyday life of home users, particularly those who
need assistance for independent living.
Speech and gesture are two most natural ways that people
interact with each other. Even though the continuous speech
and gesture recognition techniques are still not mature
enough to be deployed in real applications, we achieved very
good performance in our work through standardizing a small
set of easily learned verbal commands and gestures, and
introducing feedback mechanisms.
123
426 ? 57
Multimodal interaction devices are necessary for mobile
and ubiquitous environments. The GeeAir prototype permits us to begin developing the design space for mapping
interactions with multimodal commands. Such a space will
be necessary for optimally supporting different home users
in different context.
The initial test results show clear benefits of the multimodal device GeeAir over the universal remote controllers
and other single modality based solutions. In the future, we
plan to conduct a series of formal evaluations of GeeAir
with real home users, including elderly and disabled
inhabitants. Hopefully, the study will shed light on the
cognitive load of various combinations of modalities:
speech-gesture, joystick-gesture, speech-button, and joystick-button, in order to further improve the future design
of GeeAir.
Acknowledgments The authors would like to thank the comments
and suggestions from the anonymous reviewers. The laboratory students’ participation in the experiments is greatly appreciated. This
work is supported in part by the National High-Tech Research and
Development (863) Program of China (No. 2008AA01Z132,
2009AA011900), the Natural Science Fund of China (No. 60525202,
60533040), and the France ICT-Asia I-CROSS program. Dr. Shijian
Li is corresponding author.
References
1. Campbell LW (1997) A more ‘universal’ remote control. http://web.
media.mit.edu/*lieber/Teaching/Collab97/Collab-Projects/remote.
html
2. http://www.consumer.philips.com/consumer/en/gb/consumer/
cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_
CONSUMER/[4-in-1TV/VCR/DVD/SAT]
3. http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/
3/advanced/25/digital-12
4. http://www.logitech.com/index.cfm/remotes/universal_remotes/
devices/3898&cl=us,en
5. http://www.universalremote.com/product_detail.php?model=158
6. Lee L, Johnson T (2006) URCousin: universal remote control
user interface. In: Proceedings of the Human Interface Technologies Conference, April 2006
7. Niezen G, Hancke GP (2008) Gesture recognition as ubiquitous
input for mobile phones. International Workshop on Devices that
Alter Perception (DAP08), conjunction with Ubicomp08, 2008
8. Bolt RA (1980) Put-that-there: voice and gesture at the graphics
interface, SIGGRAPH’80, pp 262–270
9. Machate J, Burmester M, Bekiaris E (1997) Towards an intelligent multimodal and multimedia user interface providing a
new dimension of natural HMI in the teleoperation of all
home appliances by E&D users, 6th International Conference
Pers Ubiquit Comput (2010) 14:723–735
10.
11.
12.
13.
14.
15.
16.
17.
18.
Man–Machine Interactions Intelligent Systems in Business,
Montpellier, May 1997, pp 226–229
Machate J (1999) Being natural—on the use of multimodal
interaction concepts in smart homes. In: Proceedings of the HCI
International ‘99, pp 937–941
Wilson A, Oliver N (2003) Gwindows: robust stereo vision for
gesture-. based control of windows. In: Proceedings of the 5th
international conference on multimodal interfaces, New York,
NY, USA, pp 211–218
Krum DM, Omoteso O, Ribarsky W, Starner T, Hodges LF
(2002) Speech and Gesture Multimodal Control of a Whole Earth
3D Visualization Environment. In: Proceedings of Symposium on
Data Visualization, Barcelona, Spain, pp 195–200
Starner T, Auxier J, Ashbrook D, Gandy M (2000) The gesture
pendant: a self-illuminating, wearable, infrared computer vision
system for home automation control and medical monitoring.
International Symposium on Wearable Computers (ISWC00),
pp 87–95
Kela J, Korpipaa P, Mantyjarvi J, Kallio S, Savino G, Jozzo L,
Marca D (2006) Accelerometer-based gesture control for a design
environment, Personal Ubiquitous Computing, 10:285–299
Wu J, Pan G, Li S, Zhang D (2009) Gesture Recognition with a
3D Accelerometer. The Sixth International Conference on
Ubiquitous Intelligence and Computing (UIC-09), Brisbane,
Australia, 7–9 July, 2009
Rabiner L, Levinson L (1981) Isolated and connected word recognition—theory and selected applications. IEEE Trans Commun
29(5):621–659
Rabiner LR (1989) A tutorial on hidden markov models and
selected applications in speech recognition. Proc IEEE 77:257–
286
Lee C-H, Lin C-H, Juang B-H (1991) A study on speaker
adaptation of the parameters of continuous density hidden Markov models. IEEE Trans Signal Process 39(4):806–814
735
19. Davis SB, Mermelstein P (1980) Comparison of parametric
representation for monosyllabic word recognition in continuously
spoken sentences. IEEE Trans Acoust Speech Signal Process
28:357–366
20. Mitra S, Acharya T (2007) Acharya: gesture recognition: a survey. IEEE Trans Syst Man Cybern Part C 37(3):311–324
21. Schlömer T, Poppinga B, Henze N, Boll S (2008) Gesture Recognition with a Wii Controller. International Conference on
Tangible and Embedded Interaction (TEI’08), pp 11–14, Bonn
Germany, Feb. 18–20, 2008
22. Mäntylä V-M (2001) Discrete hidden markov models with
application to isolated user-dependent hand gesture recognition.
VTT publications
23. Liu J, Wang Z, Zhong L, Wickramasuriya J, Vasudevan V (2009)
uWave: accelerometer-based personalized gesture recognition
and its applications. IEEE PerCom’09, 2009
24. Mäntyjärvi J, Kela J, Korpipää P, Kallio S (2004) Enabling fast
and effortless customization in accelerometer based gesture
interaction. Proceedings of the 3rd International Conference on
Mobile and Ubiquitous Multimedia (MUM’04), ACM Press, 25–
31, October 27–29
25. Christanini J, Taylor JS (2000) An introduction to support vector
machines and other kernel-based methods. Cambridge University
Press, Cambridge
26. HTK: http://htk.eng.cam.ac.uk/
27. Frigo M, Johnson SG (2005) The design and implementation of
FFTW3. Proc IEEE 93(2)
28. Joachims T (1999) Making large-scale SVM learning practical.
Advances in kernel methods—support vector learning. In:
Schöllkopf B, Burges C, Smola A (ed) MIT-Press
29. Quinlan JR (1996) Improved use of continuous attributes in c4.5.
J Artif Intell Res 4:77–90
123
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement