Emotion recognition and its application to computer agents with spontaneous interactive capabilities.pdf

(273 KB) Pobierz
Knowledge-Based Systems 13 (2000) 497±504
www.elsevier.com/locate/knosys
Emotion recognition and its application to computer agents with
spontaneous interactive capabilities q
R. Nakatsu * , J. Nicholson, N. Tosa
ATR Media Integration and Communications Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
Abstract
In this paper, we ®rst study the recognition of emotions involved in human speech. We propose an emotion recognition algorithm based on
a neural network and also propose a method to collect a large speech database that contains emotions. We carried out emotion recognition
experiments based on the neural network trained using this database. An emotion recognition rate of approximately 50% was obtained in a
speaker-independent mode for eight emotion states.
We then tried to apply this emotion recognition algorithm to a computer agent that plays a character role in the interactive movie system
we are developing. We propose to use emotion recognition as key technology for an architecture of the computer characters with both
narrative-based and spontaneous interaction capabilities. q 2000 Elsevier Science B.V. All rights reserved.
Keywords: Emotion recognition; Neural net; Computer agent
1. Introduction
Although the importance of nonverbal aspects of commu-
nication has been recognized, until now most research has
involved nonverbal information for images. Facial expres-
sion recognition and gesture recognition are good examples.
On the other hand, the recognition of emotions involved in
human speech has been rarely treated. For the reasons
indicated above, we have studied the recognition of
emotions involved in speech and believe such recognition
is an essential research area.
One of the reasons why there has been little research on
human emotion recognition is because it is dif®cult to
collect a large amount of utterances that contain emotions.
The strategy we have adopted here is to ask a radio actor to
utter a number of words with various kinds of emotional
expressions. Then, we asked many speakers to utter utter-
ances with emotion by listening to the utterances uttered by
the actor. We adopted eight emotions including a normal
state. We succeeded in collecting a large speech database
uttered by 50 males with this method. By using a part of this
database as a training data for neural network, we obtained a
neural network that can recognize emotional utterances.
Then, we carried out the recognition experiment to evaluate
the performance of the neural network. We compared two
kinds of recognition experiment: open test and closed test,
and concluded that a recognition rate of approximately 50%
is obtainable for speaker-independent emotion recognition.
For the next step, we applied this emotion recognition
technology to computer agents that can communicate with
Nonverbal communication plays a very important role in
human communication. Telephones have been mainly used
to communicate in business, but recently, telephones are
used more and more for everyday communication among
family members and friends. The spread of cellular phones,
especially among the young generation, has accelerated this
trend. It is clear that the exchange of nonverbal information,
such as emotions, is important in all forms of communica-
tion. This means that nonverbal communication is the basis
of human communication.
In addition to human-to-human communication, commu-
nication between human and computer agents has become
more and more common. Computer agents that act as elec-
tronic secretaries or communication mediators will become
common entities in our society. As such, the capability of
communicating with humans using both verbal and non-
verbal communication channels will be essential. This will
surely make interactions between computers and humans
more intimate and human-like.
* Corresponding author. Tel.: 181-774-95-1400.
E-mail addresses: nakatsu@mic.atr.co.jp (R. Nakatsu),
jjnichol@ mic.atr.co.jp (J. Nicholson),
tosa@mic.atr.co.jp (N. Tosa).
q
Portions from the Proceedings of the Third Conference on Creativity
and Cognition, Loughborough, UK, October 10±13, 1999, pp. 135±143.
Reproduced with permission from ACM q 1999.
0950-7051/00/$ - see front matter q 2000 Elsevier Science B.V. All rights reserved.
PII: S 0950-7051(00)00070-8
782549438.005.png 782549438.006.png 782549438.007.png 782549438.008.png
 
498
R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504
Fig. 1. Processing ¯ow diagram.
humans using both verbal and nonverbal channels. We have
conducted research on interactive movie production by
applying interaction technologies to conventional movie
making techniques. The integration of interaction and narra-
tives is expected to produce a new type of experience, which
we call ªInteractive Moviesº. We can interact in as well as
watch the story in interactive movies. This gives us a great
opportunity to learn various kinds of skills and lessons
through dramatic experiences. We have already produced
a prototype system [1]. Unfortunately, this system lacks the
capability of accepting spontaneous interaction. In evalua-
tions, we have learned that spontaneous interaction is the
key element for subject participation in narratives. We
developed a second prototype system based on this evalua-
tion where emotion recognition works as a key function for
realizing spontaneous interaction capabilities.
This paper ®rst describes the emotion recognition algo-
rithm and the emotion recognition experiment we have
carried out. Then, the paper introduces the con®guration
of the interactive movie system and the structure of the
agents in the system where emotion recognition plays a
key role for the introduction of spontaneous interaction
capabilities.
b. anger, fear, sadness, joy and disgust [3]
c. neutrality, happiness, sadness, anger, fear, boredom
and disgust [4]
d. fear, anger, sadness and happiness [5].
Considering these examples, we have selected eight
emotional states in this study: anger, sadness, happiness,
fear, surprise, disgust, playfulness and neutrality.
2. Speech features: there are two kinds of speech featureÐ
phonetic feature and prosodic feature. In emotion recog-
nition, prosodic features play an important role. At the
same time, phonetic features are as important as prosodic
features, because prosodic features and phonetic features
are tightly combined when uttering speech. Furthermore,
it is impossible to express emotions by only controlling
prosodic features. Therefore, a combination of two kinds
of feature is considered in this study: one is the feature
expressing phonetic characteristics of speech, and the
other is that expressing prosodic characteristics.
3. Speaker-independent and content-independent emotion
recognition: speaker independence is an important aspect
of speech/emotion recognition. From a pragmatic stand-
point, a speaker-dependent emotion recognition system
requires a tiresome learning stage each time a new
speaker wants to use the system, so it is not easy to
use. Another point is that humans can understand the
emotions included in speech as well as the conveyed
meaning by speech even for arbitrary speakers. More-
over, content independence is indispensable for emotion
recognition. Various kinds of emotion are conveyed for
the same words or sentences in daily communication; this
is the key to rich and sensitive interpersonal communica-
tions. Thus, we adopt a neural network architecture and
introduce a training stage that uses a large number of
training
2. Recognition of emotion involved in speech
2.1. Basic principle
We have considered and emphasized the following issues
to recognize emotions.
1. Treatment of various emotional expressions: how many
and what kinds of emotional expressions are to be treated
are interesting yet dif®cult issues. The following are
some
utterances
for
a
speaker-independent
and
content-independent emotion recognition system.
examples
of
emotional
expressions
treated
in
several papers:
a. neutrality, joy, boredom, sadness, anger, fear and
indignation [2]
Fig. 1 illustrates a block diagram of the processing
¯ow. The process mainly consists of two parts: speech
processing and emotion recognition. The details of each
782549438.001.png
R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504
499
process and the system con®guration for carrying out the
emotion recognition process are described in the following
sections.
2.2. Feature extraction
(1) Speech feature calculation. Two kinds of feature are
used in emotion recognition. One is a phonetic feature and
the other is a prosodic feature. Linear predictive coding
(LPC) parameters [6], which are typical speech feature
parameters often used for speech recognition, are adopted
for the phonetic feature. The prosodic feature, on the other
hand, consists of three factors: amplitude structure,
temporal structure and pitch structure. Speech power and
pitch parameters are used for the feature expressing ampli-
tude structure and pitch structure and each can be obtained
in the LPC analysis. In addition, a delta LPC parameter is
adopted, which is calculated from LPC parameters and
expresses a time-variable feature of the speech spectrum,
since this parameter corresponds to a temporal structure.
The speech feature calculation is carried out in the
following way. Analog speech is ®rst transformed into digi-
tal speech by passing it through a 6 kHz low-pass ®lter that
is then fed into an A/D converter with a 11 kHz sampling
rate and a 16 bits accuracy. The digitized speech is then
arranged into a series of frames, where each is a set of
256 consecutive sampled data points. LPC analysis is
carried out in real time and the following feature parameters
are obtained for each of these frames:
Fig. 2. Emotion recognition part con®guration.
frames. Let these 20 frames be expressed as f 1 , f 2 ,¼,f 20 .
The feature parameters of these 20 frames are collected and
the output speech features are determined as a 300 (15 £ 20)
dimensional feature vector. This feature vector is expressed
as where F i is a vector of the 15 feature parameters corre-
sponding to frame f i :
This feature vector (FV) is then used as input for the
emotion recognition stage.
2.3. Emotion recognition
Recognizing emotions is a dif®cult task. The main reason
is that people mainly rely on meaning recognition in daily
communication, especially in business communication.
This is why speech recognition research has long-treated
emotions contained in speech as simply ¯uctuations or
noise. What makes the situation more complicated is that
emotional expressions are consciously or unconsciously
intertwined with the meaning of speech. In the unconscious
state, context rather than emotional feature plays a more
important role. As a result, the intensity of emotional
expression varies dramatically depending on the situation.
Of course, our ®nal target is to recognize emotions in speech
even if emotional expression is unconsciously mixed with
the meaning of speech. However, for the time being, this is
not our research target for the above reasons. Instead, the
strategy adopted here is to treat speech intentionally uttered
with speci®c emotional expressions, rather than speech with
unconscious emotion expressions.
There are several algorithms such as neural network or
HMMs [7] in recognition algorithms. HMMs are suitable
where the structure of the recognition object is clear to some
extent. As the structure of an emotional feature is not clear, a
neural network approach seems more suitable, so we have
adopted the neural network approach here.
(1) Con®guration of the neural network. The con®gura-
tion of the neural network for emotion recognition is shown
in Fig. 2. This network is a combination of eight sub-
networks. Each of these eight sub-networks is tuned to
recognize one of eight emotions (anger, sadness, happiness,
fear,
speech power: P;
pitch: p;
LPC parameters: c 1 , c 2 ,¼,c 12 ;
delta LPC parameter: d.
Thus, for the tth frame, the obtained feature parameters
can be expressed by
F t P t ; p t ; d t ; c 1t ; c 2t ; ¼; c 12t :
The sequence of this feature vector is fed into the speech
period extraction stage.
(2) Speech period extraction and speech feature extrac-
tion. First, the period where speech exists is extracted based
on the information of speech power. Speech power is
compared with a predetermined threshold value (PTH); if
the input speech power exceeds this threshold value for a
few consecutive frames, the speech is decided to be uttered.
After the beginning of the speech period, the input speech
power is also compared with the PTH value; if the speech
power is continuously below PTH for another few consecu-
tive frames, the speech is decided to be no longer exist. The
speech period is extracted from the whole data input through
this process.
Twenty frames are extracted for the extracted speech
period where each is situated periodically in the whole
speech period and kept the same distance from adjacent
surprise,
disgust,
playfulness
or
neutrality).
The
782549438.002.png
500
R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504
(b) Then, we ask speakers to listen to each of these utter-
ances and mimic the tones of each utterance. We record
the utterances spoken by ordinary people.
The problem with this strategy is that the spoken
emotions here are not natural but ªforced emotionsº.
However, we study the forced emotions or intentional
emotions in our research based upon the consideration
described in Section 2.3.
Since our target is speaker-independent and content-inde-
pendent emotion recognition, the following utterances were
prepared for the training process:
Fig. 3. Sub-network con®guration.
words: 100 phoneme-balanced words;
speakers: 50 male speakers and 50 female speakers;
emotions:
construction of each sub-network is shown in Fig. 3 and
basically has the same network architecture. It is a four-
layered neural network with one 300-input node corre-
sponding to the dimension of speech features and 1 output
node. The number of intermediate nodes varies depending
on the speci®c emotion. The reason we have adopted this
architecture is based on a consideration of the dif®culties in
recognizing speci®c emotions. Thus, it is easier to prepare a
speci®c neural network for each emotion and tune each
network depending on the characteristics of each emotion
to be recognized. Although negative emotions such as anger
or sadness are rather easy to recognize, positive emotions
such as happiness can be dif®cult to recognize. Thus, a
detailed architecture of the networks, such as the number
of intermediate nodes, differs depending on the speci®c
emotion.
(2) Emotion recognition by a neural network. In the
emotion recognition phase, speech feature parameters
extracted in the speech processing part are simultaneously
fed into the eight sub-networks and trained as described in
the above process. Eight values, V v 1 ; v 2 ; ¼; v 8 ; are
obtained as the result of the emotion recognition.
neutrality,
anger,
sadness,
happiness,
fear,
surprise, disgust and playfulness;
utterances: each speaker uttered 100 words eight times. In
each of the eight trials, he/she uttered words using differ-
ent emotional expressions with a total of 800 utterances
for each speaker obtained as training data.
(2) Training and recognition experiment. We used 30
speakers for training out of the 50 speakers used for data
collection. To learn the effect of the number of speakers
used for the training, we carried out ®ve types of neural
net training and obtained the following neural networks:
neural network 1: 10 types of neural network each trained
by a single speaker (#1, #2, ¼, #10);
neural network 2: ®ve types of neural network each
trained using two speakers (#1 and #2, #3 and #4, ¼,
#9 and #10);
neural network 3: two types of neural network each
trained using ®ve speakers (#1 and #2, ¼, and #5, #6
and #7, ¼, and #10);
neural network 4: a neural network trained using 10
speakers (#1 and #2 and #3 ¼ and #10);
neural network 5: a neural network trained using thirty
speakers (#1 and #2 and #3 ¼ and #30).
2.4. Emotion recognition experiment
(1) Speech database collection. It is necessary to train
each of the sub-networks for the recognition of emotions.
The most important and most dif®cult issue for neural
network training is how to collect a large amount of speech
data containing emotions. As our target is content-indepen-
dent emotion recognition, we adopted 100 phoneme-
balanced words for a training word set. Some examples
are: school, hospital, standard, and so on. Since we utter
most of these words without any special emotion in our
daily life, it is dif®cult for ordinary people to intentionally
utter them with emotions. Therefore, we have adopted the
following strategy.
In addition, we carried out two types of recognition experi-
ment to evaluate the performance of the obtained neural
networks.
Open recognition experiment: in this case, utterances
spoken by the speakers not included in the training sets
are used for the recognition experiment. Twenty speakers
(#31±#50) were used for the recognition experiment.
Closed recognition experiment: in this case, utterances
spoken by the speakers included in the training sets are
used for the recognition experiment.
(a) First, we ask a radio actor to utter 100 words with each
of the eight emotions. As a professional, he is used to
speaking various kinds of words, phrases and sentences
with intentional emotions.
The obtained recognition results for both closed recogni-
tion and open recognition are shown in Fig. 4 for male
speakers and Fig. 5 for female speakers. These results
show the following facts.
782549438.003.png
R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504
501
3. Creation of computer agents with spontaneous
interactive capabilities
3.1. Overview
As one of the applications of emotion recognition tech-
nology, we have tried to apply this technology to the compu-
ter agent in an interactive movie system we are studying.
The main reason why we study interactive movies is as
follows.
Ever since the Lumiere brothers created cinematography
at the end of the 19th century, movies have undergone
various advances in technology and content. Today, movies
have established themselves as a composite art form cover-
ing a wide range from ®ne arts to mass entertainment.
Perhaps, movies provide us with the feeling as though we
are experiencing various kinds of dramatic events and
happenings in movie narratives. However, these experi-
ences are not active and are illusions. As a result, what we
can experience, feel and learn is limited.
The emotional state of character j is determined as a
function of the emotional states of the participants and the
character:
Fig. 4. Emotion recognition results for male data.
(a) For closed recognition experiments, the recognition
rate decreases as the number of training speakers
increases. The recognition rate approaches 50±55%.
(b) For open recognition experiments, the recognition rate
increases as the number of training speakers increases.
The recognition rate approaches 50±55%.
E o j; T 1 1f l E p 1; T; E p 2; T; E o j; T:
The integration of interaction and narratives is expected
to produce a new type of experience. We call this ªInter-
active Moviesº, in which we not only can watch the story
but also can interact in the story itself. This provides us with
a totally new type of experience. In other words, we can
experience dramatic events or narratives that cannot be
encountered in our daily lives as a subject of the event,
instead of through the perspective of a third person. This
gives us a great opportunity to learn various kinds of skills
and lessons through dramatic experiences.
One of the key factors of an interactive movie system is
the computer character that interacts with the participants
who play the main characters. In the ®rst system we have
developed, the behaviors of the characters are controlled
based on the narratives [1]. We have learned in evaluating
the ®rst system that the spontaneous reaction of the charac-
ters is as important as the narrative-based interactions.
Therefore, we have tried to integrate the spontaneous inter-
action
These two trends indicate that if we have enough speakers
for training we can obtain an emotion recognition rate of
50±55% for the speaker-independent mode. Furthermore,
even when the number of the speaker is 30, we have an
approximately 50% emotion recognition rate that is satis-
factory compared with the expected recognition perfor-
mance using an adequate number of training speakers.
We have concluded through these recognition experi-
ments that we have the emotion recognition capability
whereby computer agents can communicate with people
using nonverbal communication capabilities.
capabilities
into
the
movie
characters
by
using
emotion recognition.
3.2. Spontaneous interaction
In evaluating the ®rst prototype system, we recognized
that there are generally two types of interaction for the
computer characters: narrative-based interactions and spon-
taneous interactions. We also recognized that in the ®rst
system
only
the
narrative-based
interaction
capabilities
were realized.
Basically, spontaneous interactions occur between the
participants
and
characters
and
basically
do
not
affect
Fig. 5. Emotion recognition results for female data.
782549438.004.png
Zgłoś jeśli naruszono regulamin