Emotion recognition and its application to computer agents with spontaneous interactive capabilities.pdf

Knowledge-Based Systems 13 (2000) 497±504

www.elsevier.com/locate/knosys

Emotion recognition and its application to computer agents with

spontaneous interactive capabilities q

R. Nakatsu * , J. Nicholson, N. Tosa

ATR Media Integration and Communications Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan

Abstract

In this paper, we ®rst study the recognition of emotions involved in human speech. We propose an emotion recognition algorithm based on

a neural network and also propose a method to collect a large speech database that contains emotions. We carried out emotion recognition

experiments based on the neural network trained using this database. An emotion recognition rate of approximately 50% was obtained in a

speaker-independent mode for eight emotion states.

We then tried to apply this emotion recognition algorithm to a computer agent that plays a character role in the interactive movie system

we are developing. We propose to use emotion recognition as key technology for an architecture of the computer characters with both

Keywords: Emotion recognition; Neural net; Computer agent

1. Introduction

Although the importance of nonverbal aspects of commu-

nication has been recognized, until now most research has

involved nonverbal information for images. Facial expres-

sion recognition and gesture recognition are good examples.

On the other hand, the recognition of emotions involved in

human speech has been rarely treated. For the reasons

indicated above, we have studied the recognition of

emotions involved in speech and believe such recognition

is an essential research area.

One of the reasons why there has been little research on

human emotion recognition is because it is dif®cult to

collect a large amount of utterances that contain emotions.

The strategy we have adopted here is to ask a radio actor to

utter a number of words with various kinds of emotional

expressions. Then, we asked many speakers to utter utter-

ances with emotion by listening to the utterances uttered by

the actor. We adopted eight emotions including a normal

state. We succeeded in collecting a large speech database

uttered by 50 males with this method. By using a part of this

database as a training data for neural network, we obtained a

neural network that can recognize emotional utterances.

Then, we carried out the recognition experiment to evaluate

the performance of the neural network. We compared two

kinds of recognition experiment: open test and closed test,

and concluded that a recognition rate of approximately 50%

is obtainable for speaker-independent emotion recognition.

For the next step, we applied this emotion recognition

technology to computer agents that can communicate with

Nonverbal communication plays a very important role in

human communication. Telephones have been mainly used

to communicate in business, but recently, telephones are

used more and more for everyday communication among

family members and friends. The spread of cellular phones,

especially among the young generation, has accelerated this

trend. It is clear that the exchange of nonverbal information,

such as emotions, is important in all forms of communica-

tion. This means that nonverbal communication is the basis

of human communication.

In addition to human-to-human communication, commu-

nication between human and computer agents has become

more and more common. Computer agents that act as elec-

tronic secretaries or communication mediators will become

common entities in our society. As such, the capability of

communicating with humans using both verbal and non-

verbal communication channels will be essential. This will

surely make interactions between computers and humans

more intimate and human-like.

* Corresponding author. Tel.: 181-774-95-1400.

E-mail addresses: nakatsu@mic.atr.co.jp (R. Nakatsu),

jjnichol@ mic.atr.co.jp (J. Nicholson),

tosa@mic.atr.co.jp (N. Tosa).

Portions from the Proceedings of the Third Conference on Creativity

and Cognition, Loughborough, UK, October 10±13, 1999, pp. 135±143.

Reproduced with permission from ACM q 1999.

PII: S 0950-7051(00)00070-8

498

R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504

Fig. 1. Processing ¯ow diagram.

humans using both verbal and nonverbal channels. We have

conducted research on interactive movie production by

applying interaction technologies to conventional movie

making techniques. The integration of interaction and narra-

tives is expected to produce a new type of experience, which

we call ªInteractive Moviesº. We can interact in as well as

watch the story in interactive movies. This gives us a great

opportunity to learn various kinds of skills and lessons

through dramatic experiences. We have already produced

a prototype system [1]. Unfortunately, this system lacks the

capability of accepting spontaneous interaction. In evalua-

tions, we have learned that spontaneous interaction is the

key element for subject participation in narratives. We

developed a second prototype system based on this evalua-

tion where emotion recognition works as a key function for

realizing spontaneous interaction capabilities.

This paper ®rst describes the emotion recognition algo-

rithm and the emotion recognition experiment we have

carried out. Then, the paper introduces the con®guration

of the interactive movie system and the structure of the

agents in the system where emotion recognition plays a

key role for the introduction of spontaneous interaction

capabilities.

b. anger, fear, sadness, joy and disgust [3]

c. neutrality, happiness, sadness, anger, fear, boredom

and disgust [4]

d. fear, anger, sadness and happiness [5].

Considering these examples, we have selected eight

emotional states in this study: anger, sadness, happiness,

fear, surprise, disgust, playfulness and neutrality.

2. Speech features: there are two kinds of speech featureÐ

phonetic feature and prosodic feature. In emotion recog-

nition, prosodic features play an important role. At the

same time, phonetic features are as important as prosodic

features, because prosodic features and phonetic features

are tightly combined when uttering speech. Furthermore,

it is impossible to express emotions by only controlling

prosodic features. Therefore, a combination of two kinds

of feature is considered in this study: one is the feature

expressing phonetic characteristics of speech, and the

other is that expressing prosodic characteristics.

3. Speaker-independent and content-independent emotion

recognition: speaker independence is an important aspect

of speech/emotion recognition. From a pragmatic stand-

point, a speaker-dependent emotion recognition system

requires a tiresome learning stage each time a new

speaker wants to use the system, so it is not easy to

use. Another point is that humans can understand the

emotions included in speech as well as the conveyed

meaning by speech even for arbitrary speakers. More-

over, content independence is indispensable for emotion

recognition. Various kinds of emotion are conveyed for

the same words or sentences in daily communication; this

is the key to rich and sensitive interpersonal communica-

tions. Thus, we adopt a neural network architecture and

introduce a training stage that uses a large number of

training

2. Recognition of emotion involved in speech

2.1. Basic principle

We have considered and emphasized the following issues

to recognize emotions.

1. Treatment of various emotional expressions: how many

and what kinds of emotional expressions are to be treated

are interesting yet dif®cult issues. The following are

some

utterances

for

speaker-independent

and

content-independent emotion recognition system.

examples

emotional

expressions

treated

several papers:

a. neutrality, joy, boredom, sadness, anger, fear and

indignation [2]

Fig. 1 illustrates a block diagram of the processing

¯ow. The process mainly consists of two parts: speech

processing and emotion recognition. The details of each

R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504

499

process and the system con®guration for carrying out the

emotion recognition process are described in the following

sections.

2.2. Feature extraction

(1) Speech feature calculation. Two kinds of feature are

used in emotion recognition. One is a phonetic feature and

the other is a prosodic feature. Linear predictive coding

(LPC) parameters [6], which are typical speech feature

parameters often used for speech recognition, are adopted

for the phonetic feature. The prosodic feature, on the other

hand, consists of three factors: amplitude structure,

temporal structure and pitch structure. Speech power and

pitch parameters are used for the feature expressing ampli-

tude structure and pitch structure and each can be obtained

in the LPC analysis. In addition, a delta LPC parameter is

adopted, which is calculated from LPC parameters and

expresses a time-variable feature of the speech spectrum,

since this parameter corresponds to a temporal structure.

The speech feature calculation is carried out in the

following way. Analog speech is ®rst transformed into digi-

tal speech by passing it through a 6 kHz low-pass ®lter that

is then fed into an A/D converter with a 11 kHz sampling

rate and a 16 bits accuracy. The digitized speech is then

arranged into a series of frames, where each is a set of

256 consecutive sampled data points. LPC analysis is

carried out in real time and the following feature parameters

are obtained for each of these frames:

Fig. 2. Emotion recognition part con®guration.

frames. Let these 20 frames be expressed as f 1 , f 2 ,¼,f 20 .

The feature parameters of these 20 frames are collected and

the output speech features are determined as a 300 (15 £ 20)

dimensional feature vector. This feature vector is expressed

as where F i is a vector of the 15 feature parameters corre-

sponding to frame f i :

This feature vector (FV) is then used as input for the

emotion recognition stage.

2.3. Emotion recognition

Recognizing emotions is a dif®cult task. The main reason

is that people mainly rely on meaning recognition in daily

communication, especially in business communication.

This is why speech recognition research has long-treated

emotions contained in speech as simply ¯uctuations or

noise. What makes the situation more complicated is that

emotional expressions are consciously or unconsciously

intertwined with the meaning of speech. In the unconscious

state, context rather than emotional feature plays a more

important role. As a result, the intensity of emotional

expression varies dramatically depending on the situation.

Of course, our ®nal target is to recognize emotions in speech

even if emotional expression is unconsciously mixed with

the meaning of speech. However, for the time being, this is

not our research target for the above reasons. Instead, the

strategy adopted here is to treat speech intentionally uttered

with speci®c emotional expressions, rather than speech with

unconscious emotion expressions.

There are several algorithms such as neural network or

HMMs [7] in recognition algorithms. HMMs are suitable

where the structure of the recognition object is clear to some

extent. As the structure of an emotional feature is not clear, a

neural network approach seems more suitable, so we have

adopted the neural network approach here.

(1) Con®guration of the neural network. The con®gura-

tion of the neural network for emotion recognition is shown

in Fig. 2. This network is a combination of eight sub-

networks. Each of these eight sub-networks is tuned to

recognize one of eight emotions (anger, sadness, happiness,

fear,

speech power: P;

pitch: p;

LPC parameters: c 1 , c 2 ,¼,c 12 ;

delta LPC parameter: d.

Thus, for the tth frame, the obtained feature parameters

can be expressed by

F t P t ; p t ; d t ; c 1t ; c 2t ; ¼; c 12t :

The sequence of this feature vector is fed into the speech

period extraction stage.

(2) Speech period extraction and speech feature extrac-

tion. First, the period where speech exists is extracted based

on the information of speech power. Speech power is

compared with a predetermined threshold value (PTH); if

the input speech power exceeds this threshold value for a

few consecutive frames, the speech is decided to be uttered.

After the beginning of the speech period, the input speech

power is also compared with the PTH value; if the speech

power is continuously below PTH for another few consecu-

tive frames, the speech is decided to be no longer exist. The

speech period is extracted from the whole data input through

this process.

Twenty frames are extracted for the extracted speech

period where each is situated periodically in the whole

speech period and kept the same distance from adjacent

surprise,

disgust,

playfulness

neutrality).

The

500

R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504

(b) Then, we ask speakers to listen to each of these utter-

ances and mimic the tones of each utterance. We record

the utterances spoken by ordinary people.

The problem with this strategy is that the spoken

emotions here are not natural but ªforced emotionsº.

However, we study the forced emotions or intentional

emotions in our research based upon the consideration

described in Section 2.3.

Since our target is speaker-independent and content-inde-

pendent emotion recognition, the following utterances were

prepared for the training process:

Fig. 3. Sub-network con®guration.

words: 100 phoneme-balanced words;

speakers: 50 male speakers and 50 female speakers;

emotions:

construction of each sub-network is shown in Fig. 3 and

basically has the same network architecture. It is a four-

layered neural network with one 300-input node corre-

sponding to the dimension of speech features and 1 output

node. The number of intermediate nodes varies depending

on the speci®c emotion. The reason we have adopted this

architecture is based on a consideration of the dif®culties in

recognizing speci®c emotions. Thus, it is easier to prepare a

speci®c neural network for each emotion and tune each

network depending on the characteristics of each emotion

to be recognized. Although negative emotions such as anger

or sadness are rather easy to recognize, positive emotions

such as happiness can be dif®cult to recognize. Thus, a

detailed architecture of the networks, such as the number

of intermediate nodes, differs depending on the speci®c

emotion.

(2) Emotion recognition by a neural network. In the

emotion recognition phase, speech feature parameters

extracted in the speech processing part are simultaneously

fed into the eight sub-networks and trained as described in

the above process. Eight values, V v 1 ; v 2 ; ¼; v 8 ; are

obtained as the result of the emotion recognition.

neutrality,

anger,

sadness,

happiness,

fear,

surprise, disgust and playfulness;

utterances: each speaker uttered 100 words eight times. In

each of the eight trials, he/she uttered words using differ-

ent emotional expressions with a total of 800 utterances

for each speaker obtained as training data.

(2) Training and recognition experiment. We used 30

speakers for training out of the 50 speakers used for data

collection. To learn the effect of the number of speakers

used for the training, we carried out ®ve types of neural

net training and obtained the following neural networks:

neural network 1: 10 types of neural network each trained

by a single speaker (#1, #2, ¼, #10);

neural network 2: ®ve types of neural network each

trained using two speakers (#1 and #2, #3 and #4, ¼,

#9 and #10);

neural network 3: two types of neural network each

trained using ®ve speakers (#1 and #2, ¼, and #5, #6

and #7, ¼, and #10);

neural network 4: a neural network trained using 10

speakers (#1 and #2 and #3 ¼ and #10);

neural network 5: a neural network trained using thirty

speakers (#1 and #2 and #3 ¼ and #30).

2.4. Emotion recognition experiment

(1) Speech database collection. It is necessary to train

each of the sub-networks for the recognition of emotions.

The most important and most dif®cult issue for neural

network training is how to collect a large amount of speech

data containing emotions. As our target is content-indepen-

dent emotion recognition, we adopted 100 phoneme-

balanced words for a training word set. Some examples

are: school, hospital, standard, and so on. Since we utter

most of these words without any special emotion in our

daily life, it is dif®cult for ordinary people to intentionally

utter them with emotions. Therefore, we have adopted the

following strategy.

In addition, we carried out two types of recognition experi-

ment to evaluate the performance of the obtained neural

networks.

Open recognition experiment: in this case, utterances

spoken by the speakers not included in the training sets

are used for the recognition experiment. Twenty speakers

(#31±#50) were used for the recognition experiment.

Closed recognition experiment: in this case, utterances

spoken by the speakers included in the training sets are

used for the recognition experiment.

(a) First, we ask a radio actor to utter 100 words with each

of the eight emotions. As a professional, he is used to

speaking various kinds of words, phrases and sentences

with intentional emotions.

The obtained recognition results for both closed recogni-

tion and open recognition are shown in Fig. 4 for male

speakers and Fig. 5 for female speakers. These results

show the following facts.

R. Nakatsu et al. / Knowledge-Based Systems 13 (2000) 497±504

501

3. Creation of computer agents with spontaneous

interactive capabilities

3.1. Overview

As one of the applications of emotion recognition tech-

nology, we have tried to apply this technology to the compu-

ter agent in an interactive movie system we are studying.

The main reason why we study interactive movies is as

follows.

Ever since the Lumiere brothers created cinematography

at the end of the 19th century, movies have undergone

various advances in technology and content. Today, movies

have established themselves as a composite art form cover-

ing a wide range from ®ne arts to mass entertainment.

Perhaps, movies provide us with the feeling as though we

are experiencing various kinds of dramatic events and

happenings in movie narratives. However, these experi-

ences are not active and are illusions. As a result, what we

can experience, feel and learn is limited.

The emotional state of character j is determined as a

function of the emotional states of the participants and the

character:

Fig. 4. Emotion recognition results for male data.

(a) For closed recognition experiments, the recognition

rate decreases as the number of training speakers

increases. The recognition rate approaches 50±55%.

(b) For open recognition experiments, the recognition rate

increases as the number of training speakers increases.

The recognition rate approaches 50±55%.

E o j; T 1 1f l E p 1; T; E p 2; T; E o j; T:

The integration of interaction and narratives is expected

to produce a new type of experience. We call this ªInter-

active Moviesº, in which we not only can watch the story

but also can interact in the story itself. This provides us with

a totally new type of experience. In other words, we can

experience dramatic events or narratives that cannot be

encountered in our daily lives as a subject of the event,

instead of through the perspective of a third person. This

gives us a great opportunity to learn various kinds of skills

and lessons through dramatic experiences.

One of the key factors of an interactive movie system is

the computer character that interacts with the participants

who play the main characters. In the ®rst system we have

developed, the behaviors of the characters are controlled

based on the narratives [1]. We have learned in evaluating

the ®rst system that the spontaneous reaction of the charac-

ters is as important as the narrative-based interactions.

Therefore, we have tried to integrate the spontaneous inter-

action

These two trends indicate that if we have enough speakers

for training we can obtain an emotion recognition rate of

50±55% for the speaker-independent mode. Furthermore,

even when the number of the speaker is 30, we have an

approximately 50% emotion recognition rate that is satis-

factory compared with the expected recognition perfor-

mance using an adequate number of training speakers.

We have concluded through these recognition experi-

ments that we have the emotion recognition capability

whereby computer agents can communicate with people

using nonverbal communication capabilities.

capabilities

into

the

movie

characters

using

emotion recognition.

3.2. Spontaneous interaction

In evaluating the ®rst prototype system, we recognized

that there are generally two types of interaction for the

computer characters: narrative-based interactions and spon-

taneous interactions. We also recognized that in the ®rst

system

only

the

narrative-based

interaction

capabilities

were realized.

Basically, spontaneous interactions occur between the

participants

and

characters

and

basically

not

affect

Fig. 5. Emotion recognition results for female data.

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: