Using Linguistic Annotations in Statistical MAchine Translation of Film Subtitles.pdf

Using Linguistic Annotations in

Statistical Machine Translation of Film Subtitles

Christian Hardmeier

Fondazione Bruno Kessler

Human Language Technologies

Via Sommarive, 18

I-38050 Povo (Trento)

hardmeier@fbk.eu

Martin Volk

Universitat Z urich

Inst. f ur Computerlinguistik

Binzm uhlestrasse 14

CH-8050 Z urich

volk@cl.uzh.ch

Abstract

Statistical Machine Translation (SMT) has

been successfully employed to support

translation of ﬁlm subtitles. We explore

the integration of Constraint Grammar

corpus annotations into a Swedish–Danish

subtitle SMT system in the framework of

factored SMT. While the usefulness of the

annotations is limited with large amounts

of parallel data, we show that linguistic an-

notations can increase the gains in transla-

tion quality when monolingual data in the

target language is added to an SMT system

based on a small parallel corpus.

1 Introduction

1 million subtitles of the training corpus used by

Volk and Harder was morphologically annotated

with the DanGram parser (Bick, 2001). We in-

tegrated the annotations into the translation pro-

cess using the methods of factored Statistical Ma-

chine Translation (Koehn and Hoang, 2007) im-

plemented in the widely used Moses software. Af-

ter describing the corpus data and giving a short

overview over the methods used, we present a

number of experiments comparing different fac-

tored SMT setups. The experiments are then repli-

cated with reduced training corpora which contain

only part of the available training data. These se-

ries of experiments provide insights about the im-

pact of corpus size on the effectivity of using lin-

guistic abstractions for SMT.

In countries where foreign-language ﬁlms and se-

ries on television are routinely subtitled rather than

dubbed, there is a considerable demand for efﬁ-

ciently produced subtitle translations. Although

superﬁcially it may seem that subtitles are not ap-

propriate for automatic processing as a result of

their literary character, it turns out that their typi-

cal text structure, characterised by brevity and syn-

tactic simplicity, and the immense text volumes

processed daily by specialised subtitling compa-

nies make it possible to produce raw translations

of ﬁlm subtitles with statistical methods quite ef-

fectively. If these raw translations are subse-

quently post-edited by skilled staff, production

quality translations can be obtained with consider-

ably less effort than if the subtitles were translated

by human translators with no computer assistance.

A successful subtitle Machine Translation sys-

tem for the language pair Swedish–Danish, which

has now entered into productive use, has been pre-

sented by Volk and Harder (2007). The goal of the

present study is to explore whether and how the

quality of a Statistical Machine Translation (SMT)

system of ﬁlm subtitles can be improved by us-

ing linguistic annotations. To this end, a subset of

2 Machine translation of subtitles

As a text genre, subtitles play a curious role in

a complex environment of different media and

modalities. They depend on the medium ﬁlm,

which combines a visual channel with an audi-

tive component composed of spoken language and

non-linguistic elements such as noise or music.

Within this framework, they render the spoken di-

alogue into written text, are blended in with the vi-

sual channel and displayed simultaneously as the

original sound track is played back, which redun-

dantly contains the same information in a form

that may or may not be accessible to the viewer.

In their linguistic form, subtitles should be faith-

ful, both in contents and in style, to the ﬁlm dia-

logue which they represent. This means in partic-

ular that they usually try to convey an impression

of orality. On the other hand, they are constrained

by the mode of their presentation: short, written

captions superimposed on the picture frame.

According to Becquemont (1996), the charac-

teristics of subtitles are governed by the inter-

play of two conﬂicting principles: unobtrusive-

ness (discretion) and readability (lisibilite). In

KristiinaJokinenandEckhardBick(Eds.)

NODALIDA2009ConferenceProceedings,pp.57–64

ChristianHardmeierandMartinVolk

order to provide a satisfactory experience to the

viewers, it is paramount that the subtitles help

them quickly understand the meaning of the dia-

logue without distracting them from enjoying the

ﬁlm. The amount of text that can be displayed at

one time is limited by the area of the screen that

may be covered by subtitles (usually no more than

two lines) and by the minimum time the subtitle

must remain on screen to ensure that it can actually

be read. As a result, the subtitle text must be short-

ened with respect to the full dialogue text in the

actors’ script. The extent of the reduction depends

on the script and on the exact limitations imposed

for a speciﬁc subtitling task, but may amount to

as much as 30 % and reach 50 % in extreme cases

(Tomaszkiewicz, 1993, 6).

trusiveness and readability, they are not very fre-

quent.

It is worth noting that, unlike rule-based Ma-

chine Translation systems, a statistical system

does not in general have any difﬁculties translat-

ing ungrammatical or fragmentary input: phrase-

based SMT, operating entirely on the level of

words and word sequences, does not require the

input to be amenable to any particular kind of lin-

guistic analysis such as parsing. Whilst this ap-

proach makes it difﬁcult to handle some linguistic

challenges such as long-distance dependencies, it

has the advantage of making the system more ro-

bust to unexpected input, which is more important

for subtitles.

We have only been able to sketch the character-

istics of the subtitle text genre in this paper. Dıaz-

Cintas and Remael (2007) provide a detailed intro-

duction, including the linguistics of subtitling and

translation issues, and Pedersen (2007) discusses

the peculiarities of subtitling in Scandinavia.

As a result of this processing and the consid-

erations underlying it, subtitles have a number of

properties that make them especially well suited

for Statistical Machine Translation. Owing to their

presentational constraints, they mainly consist of

comparatively short and simple phrases. Current

SMT systems, when trained on a sufﬁcient amount

of data, have reliable ways of handling word trans-

lation and local structure. By contrast, they are

still fairly weak at modelling long-range depen-

dencies and reordering. Compared to other text

genres, this weakness is less of an issue in the Sta-

tistical Machine Translation of subtitles thanks to

their brevity and simple structure. Indeed, half

of the subtitles in the Swedish part of our par-

allel training corpus are no more than 11 tokens

long, including two tokens to mark the beginning

and the end of the segment and counting every

punctuation mark as a separate token. A consider-

able number of subtitles only contains one or two

words, besides punctuation, often consisting en-

tirely of a few words of afﬁrmation, negation or

abuse. These subtitles can easily be translated by

an SMT system that has seen similar examples be-

fore.

3 Constraint Grammar annotations

To explore the potential of linguistically annotated

data, our complete subtitle corpus, both in Danish

and in Swedish, was linguistically analysed with

the DanGram Constraint Grammar (CG) parser

(Bick, 2001), a system originally developed for

the analysis of Danish for which there is also a

Swedish grammar. Constraint Grammar (Karls-

son, 1990) is a formalism for natural language

parsing. Conceptually, a CG parser ﬁrst produces

possible analyses for each word by considering its

morphological features and then applies constrain-

ing rules to ﬁlter out analyses that do not ﬁt into

the context. Thus, the word forms are gradually

disambiguated, until only one analysis remains;

multiple analyses may be retained if the sentence

is ambiguous.

The annotations produced by the DanGram

parser were output as tags attached to individual

words as in the following example:

The orientation of the genre towards spoken lan-

guage also has some disadvantages for Machine

Translation systems. It is possible that the lan-

guage of the subtitles, inﬂuenced by characteris-

tics of speech, contains unexpected features such

as stutterings, word repetitions or renderings of

non-standard pronunciations that confuse the sys-

tem. Such features are occasionally employed by

subtitlers to lend additional colour to the text, but

as they are in stark conﬂict with the ideals of unob-

Vad[vad]<interr>INDPNEUSNOM@ACC>

vet[veta]<mv>VPRAKT@FS-QUE

du [du]PERS2SUTRSNOM@<SUBJ

om [om]PRP@<PIV

det[den]<dem>PERSNEU3SACC@P<

In addition to the word forms and the accompany-

ing lemmas (in square brackets), the annotations

UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles

contained part-of-speech (POS) tags such as INDP

for “independent pronoun” or V for “verb”, a mor-

phological analysis for each word (such as NEUS

NOM for “neuter singular nominative”) and a tag

specifying the syntactic function of the word in

the sentence (such as @ACC> , indicating that the

sentence-initial pronoun is an accusative object of

the following verb). For some words, more ﬁne-

grained part-of-speech information was speciﬁed

in angle brackets, such as <interr> for “interrog-

ative pronoun” or <mv> for “verb of movement”.

In our experiments, we used word forms, lemmas,

POS tags and morphological analyses. The ﬁne-

grained POS tags and the syntax tags were not

used.

language requires a different word order, reorder-

ing is possible at the cost of a score penalty. The

translation model has no notion of sequence, so

it cannot control reordering. The language model

can, but it has no access to the source language

text, so it considers word order only from the point

of view of TL grammaticality and cannot model

systematic differences in word order between two

languages. Lexical reordering models (Koehn et

al., 2005) address this issue in a more explicit way

by modelling the probability of certain changes in

word order, such as swapping words, conditioned

on the source and target language phrase pair that

is being processed.

In its basic form, Statistical Machine Transla-

tion treats word tokens as atomic and does not

permit further decomposition or access to single

features of the words. Factored SMT (Koehn and

Hoang, 2007) extends this model by represent-

ing words as vectors composed of a number of

features and makes it possible to integrate word-

level annotations such as those produced by a Con-

straint Grammar parser into the translation pro-

cess. The individual components of the feature

vectors are called factors . In order to map be-

tween different factors on the target language side,

the Moses decoder works with generation mod-

els , which are implemented as dictionaries and ex-

tracted from the target-language side of the train-

ing corpus. They can be used, e. g., to generate

word forms from lemmas and morphology tags, or

to transform word forms into part-of-speech tags,

which could then be checked using a language

model.

4 Factored Statistical Machine

Translation

Statistical Machine Translation formalises the

translation process by modelling the probabilities

of target language (TL) output strings T given a

source language (SL) input string S , p ( T|S ), and

conducting a search for the output string T with

the highest probability. In the Moses decoder

(Koehn et al., 2007), which we used in our exper-

iments, this probability is decomposed into a log-

linear combination of a number of feature func-

tions h i ( S,T ), which map a pair of a source and a

target language element to a score based on differ-

ent submodels such as translation models or lan-

guage models. Each feature function is associated

with a weight l i that speciﬁes its contribution to

the overall score:

T =arg max

log p ( T|S )

5 Experiments with the full corpus

=arg max

å i

l i h i ( S,T )

We ran three series of experiments to study the

effects of different SMT system setups on trans-

lation quality with three different conﬁgurations

of training corpus sizes. For each condition, sev-

eral Statistical Machine Translation systems were

trained and evaluated.

In the full data condition, the complete system

was trained on a parallel corpus of some 900,000

subtitles with source language Swedish and target

language Danish, corresponding to around 10 mil-

lion tokens in each language. The feature weights

were optimised using minimum error rate train-

ing (Och, 2003) on a development set of 1,000

subtitles that had not been used for training, then

the system was evaluated on a 10,000 subtitle test

The translation models employed in factored

SMT are phrase-based. The phrases included in

a translation model are extracted from a word-

aligned parallel corpus with the techniques de-

scribed by Koehn et al. (2003). The associated

probabilities are estimated by the relative frequen-

cies of the extracted phrase pairs in the same cor-

pus. For language modelling , we used the SRILM

toolkit (Stolcke, 2002); unless otherwise speciﬁed,

6-gram language models with modiﬁed Kneser-

Ney smoothing were used.

The SMT decoder tries to translate the words

and phrases of the source language sentence in the

order in which they occur in the input. If the target

ChristianHardmeierandMartinVolk

set that had been held out during the whole de-

velopment phase. The translations were evalu-

ated with the widely used BLEU and NIST scores

(Papineni et al., 2002; Doddington, 2002). The

outcomes of different experiments were compared

with a randomisation-based hypothesis test (Co-

hen, 1995, 165–177). The test was two-sided, and

the conﬁdence level was ﬁxed at 95 %.

The results of the experiments can be found in

table 1. The baseline system used only a transla-

tion model operating on word forms and a 6-gram

language model on word forms. This is a stan-

dard setup for an unfactored SMT system. Two

systems additionally included a 6-gram language

model operating on part-of-speech tags and a 5-

gram language model operating on morphology

tags, respectively. The annotation factors required

by these language models were produced from the

word forms by suitable generation models.

In the full data condition, both the part-

of-speech and the morphology language model

brought a slight, but statistically signiﬁcant gain

in terms of BLEU scores, which indicates that

abstract information about grammar can in some

cases help the SMT system choose the right words.

The improvement is small; indeed, it is not re-

ﬂected in the NIST scores, but some beneﬁcial ef-

fects of the additional language models can be ob-

served in the individual output sentences.

One thing that can be achieved by taking word

class information into account is the disambigua-

tion of ambiguous word forms. Consider the fol-

lowing example:

ditional language models helped to rule out this

error and correctly translate mitt emot as over for ,

yielding a much better translation. Neither of them

output the adverb lige ‘just’ found in the reference

translation, for which there is no explicit equiva-

lent in the input sentence.

In the next example, the POS and the morphol-

ogy language model produced different output:

Input: Daliga kontrakt, dalig ledning, daliga agenter.

Reference: Darlige kontrakter, darlig styring, darlige

agenter.

Baseline: Darlige kontrakt, darlig forbindelse, darlige

agenter.

POS: Darlige kontrakt, darlig ledelse, darlige agenter.

Morphology: Darlige kontrakter, darlig forbindelse,

darlige agenter.

In Swedish, the indeﬁnite singular and plu-

ral forms of the word kontrakt ‘contract(s)’ are

homonymous. The two SMT systems without sup-

port for morphological analysis incorrectly pro-

duced the singular form of the noun in Danish.

The morphology language model recognised that

the plural adjective darlige ‘bad’ is more likely

to be followed by a plural noun and preferred

the correct Danish plural form kontrakter ‘con-

tracts’. The different translations of the word

ledning as ‘management’ or ‘connection’ can be

pinned down to a subtle inﬂuence of the generation

model probability estimates. They illustrate how

sensitive the system output is in the face of true

ambiguity. None of the systems presented here has

the capability of reliably choosing the right word

based on the context in this case.

In three experiments, the baseline conﬁguration

was extended by adding lexical reordering mod-

els conditioned on word forms, lemmas and part-

of-speech tags, respectively. As in the language

model experiments, the required annotation fac-

tors on the TL side were produced by generation

models.

The lexical reordering models turn out to be

useful in the full data experiments only when con-

ditioned on word forms. When conditioned on

lemmas, the score is not signiﬁcantly different

from the baseline score, and when conditioned on

part-of-speech tags, it is signiﬁcantly lower. In this

case, the most valuable information for lexical re-

ordering lies in the word form itself. Lemma and

part of speech are obviously not the right abstrac-

tions to model the reordering processes when suf-

ﬁcient data is available.

Input: Ingen vill bo mitt emot en ismaskin.

Reference: Ingen vil bo lige over for en ismaskine.

Baseline: Ingen vil bo mit imod en ismaskin.

POS/Morphology: Ingen vil bo over for en ismaskin.

Since the word ismaskin ‘ice machine’ does not

occur in the Swedish part of the training corpus,

none of the SMT systems was able to translate it.

All of them copied the Swedish input word liter-

ally to the output, which is a mistake that cannot

be ﬁxed by a language model. However, there is a

clear difference in the translation of the phrase mitt

emot ‘opposite’. For some reason, the baseline

system chose to translate the two words separately

and mistakenly interpreted the adverb mitt , which

is part of the Swedish expression, as the homony-

mous ﬁrst person neuter possessive pronoun ‘my’,

translating the Swedish phrase as ungrammatical

Danish mit imod ‘my against’. Both of the ad-

UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles

Table 1 Experimental results

full data

symmetric

asymmetric

BLEU NIST

Baseline

53.67 % 8.18

42.12 % 6.83

44.85 % 7.10

Language models

parts of speech ? 53.90 % 8.17 ? 42.59 % 6.87 44.71 % 7.08

morphology ? 54.07 % 8.18 ? 42.86 % 6.92 ? 44.95 % 7.09

Lexical reordering

word forms ? 53.99 % 8.21

42.13 % 6.83

44.72 % 7.05

lemmas

53.59 % 8.15 ? 42.30 % 6.86

44.71 % 7.06

parts of speech

53.36 % 8.13 ? 42.33 % 6.86

44.63 % 7.05

Analytical translation

53.73 % 8.18 ? 42.28 % 6.90 ? 46.73 % 7.34

? BLEU score signiﬁcantly above baseline ( p<. 05)

BLEU score signiﬁcantly below baseline ( p<. 05)

Another system, which we call the analytical

translation system, was modelled on suggestions

by Koehn and Hoang (2007) and Bojar (2007). It

used the lemmas and the output of the morpholog-

ical analysis to decompose the translation process

and use separate components to handle the transfer

of lexical and grammatical information. In order

to achieve this, the baseline system was extended

with additional translation tables mapping SL lem-

mas to TL lemmas and SL morphology tags to TL

morphology tags, respectively. In the target lan-

guage, a generation model was used to transform

lemmas and morphology tags into word forms.

The results reported by Koehn and Hoang (2007)

strongly indicate that this translation approach is

not sufﬁcient on its own; instead, the decomposed

translation approach should be combined with a

standard word form translation model so that one

can be used in those cases where the other fails.

This conﬁguration was therefore adopted for our

experiments.

The analytical translation approach fails to

achieve any signiﬁcant score improvement with

the full parallel corpus. Closer examination of

the MT output reveals that the strategy of using

lemmas and morphological information to trans-

late unknown word forms works in principle, as

shown by the following example:

Input: Molly har visat mig br ollopsfotona.

Reference: Molly har vist mig fotoene fra brylluppet.

Baseline: Molly har vist mig br ollopsfotona.

Analytical: Molly har vist mig bryllupsbillederne.

In this sentence, there can be no doubt that the out-

put produced by the analytical system is superior

to that of the baseline system. Where the base-

line system copied the Swedish word brollops-

fotona ‘wedding photos’ literally into the Dan-

ish text, the translation found by the analytical

model, bryllupsbillederne ‘wedding pictures’, is

both semantically and syntactically ﬂawless. Un-

fortunately, the reference translation uses different

words, so the evaluation scores will not reﬂect this

improvement.

The lack of success of analytical translation in

terms of evaluation scores can be ascribed to at

least three factors: Firstly, there are relatively few

vocabulary gaps in our data, which is due to the

size of training corpus. Only 1.19 % (1,311 of

109,823) of the input tokens are tagged as un-

known by the decoder in the baseline system. As

a result, there is not much room for improvement

with an approach speciﬁcally designed to handle

vocabulary coverage, especially if this approach

itself fails in some of the cases missed by the base-

line system: Analytical translation brings this ﬁg-

ure down to 0.88 % (970 tokens), but no further.

Secondly, employing generation tables trained on

the same corpus as the translation tables used by

the system limits the attainable gains from the out-

set, since a required word form that is not found in

the translation table is likely to be missing from

the generation table, too. Thirdly, in case of vo-

cabulary gaps in the translation tables, chances

are that the system will not be able to produce

the optimal translation for the input sentence. In-

stead, an approach like analytical translation aims

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: