Using Linguistic Annotations in Statistical MAchine Translation of Film Subtitles.pdf

(394 KB) Pobierz
381580661 UNPDF
Using Linguistic Annotations in
Statistical Machine Translation of Film Subtitles
Christian Hardmeier
Fondazione Bruno Kessler
Human Language Technologies
Via Sommarive, 18
I-38050 Povo (Trento)
hardmeier@fbk.eu
Martin Volk
Universitat Z urich
Inst. f ur Computerlinguistik
Binzm uhlestrasse 14
CH-8050 Z urich
volk@cl.uzh.ch
Abstract
Statistical Machine Translation (SMT) has
been successfully employed to support
translation of film subtitles. We explore
the integration of Constraint Grammar
corpus annotations into a Swedish–Danish
subtitle SMT system in the framework of
factored SMT. While the usefulness of the
annotations is limited with large amounts
of parallel data, we show that linguistic an-
notations can increase the gains in transla-
tion quality when monolingual data in the
target language is added to an SMT system
based on a small parallel corpus.
1 Introduction
1 million subtitles of the training corpus used by
Volk and Harder was morphologically annotated
with the DanGram parser (Bick, 2001). We in-
tegrated the annotations into the translation pro-
cess using the methods of factored Statistical Ma-
chine Translation (Koehn and Hoang, 2007) im-
plemented in the widely used Moses software. Af-
ter describing the corpus data and giving a short
overview over the methods used, we present a
number of experiments comparing different fac-
tored SMT setups. The experiments are then repli-
cated with reduced training corpora which contain
only part of the available training data. These se-
ries of experiments provide insights about the im-
pact of corpus size on the effectivity of using lin-
guistic abstractions for SMT.
In countries where foreign-language films and se-
ries on television are routinely subtitled rather than
dubbed, there is a considerable demand for effi-
ciently produced subtitle translations. Although
superficially it may seem that subtitles are not ap-
propriate for automatic processing as a result of
their literary character, it turns out that their typi-
cal text structure, characterised by brevity and syn-
tactic simplicity, and the immense text volumes
processed daily by specialised subtitling compa-
nies make it possible to produce raw translations
of film subtitles with statistical methods quite ef-
fectively. If these raw translations are subse-
quently post-edited by skilled staff, production
quality translations can be obtained with consider-
ably less effort than if the subtitles were translated
by human translators with no computer assistance.
A successful subtitle Machine Translation sys-
tem for the language pair Swedish–Danish, which
has now entered into productive use, has been pre-
sented by Volk and Harder (2007). The goal of the
present study is to explore whether and how the
quality of a Statistical Machine Translation (SMT)
system of film subtitles can be improved by us-
ing linguistic annotations. To this end, a subset of
2 Machine translation of subtitles
As a text genre, subtitles play a curious role in
a complex environment of different media and
modalities. They depend on the medium film,
which combines a visual channel with an audi-
tive component composed of spoken language and
non-linguistic elements such as noise or music.
Within this framework, they render the spoken di-
alogue into written text, are blended in with the vi-
sual channel and displayed simultaneously as the
original sound track is played back, which redun-
dantly contains the same information in a form
that may or may not be accessible to the viewer.
In their linguistic form, subtitles should be faith-
ful, both in contents and in style, to the film dia-
logue which they represent. This means in partic-
ular that they usually try to convey an impression
of orality. On the other hand, they are constrained
by the mode of their presentation: short, written
captions superimposed on the picture frame.
According to Becquemont (1996), the charac-
teristics of subtitles are governed by the inter-
play of two conflicting principles: unobtrusive-
ness (discretion) and readability (lisibilite). In
KristiinaJokinenandEckhardBick(Eds.)
NODALIDA2009ConferenceProceedings,pp.57–64
 
ChristianHardmeierandMartinVolk
order to provide a satisfactory experience to the
viewers, it is paramount that the subtitles help
them quickly understand the meaning of the dia-
logue without distracting them from enjoying the
film. The amount of text that can be displayed at
one time is limited by the area of the screen that
may be covered by subtitles (usually no more than
two lines) and by the minimum time the subtitle
must remain on screen to ensure that it can actually
be read. As a result, the subtitle text must be short-
ened with respect to the full dialogue text in the
actors’ script. The extent of the reduction depends
on the script and on the exact limitations imposed
for a specific subtitling task, but may amount to
as much as 30 % and reach 50 % in extreme cases
(Tomaszkiewicz, 1993, 6).
trusiveness and readability, they are not very fre-
quent.
It is worth noting that, unlike rule-based Ma-
chine Translation systems, a statistical system
does not in general have any difficulties translat-
ing ungrammatical or fragmentary input: phrase-
based SMT, operating entirely on the level of
words and word sequences, does not require the
input to be amenable to any particular kind of lin-
guistic analysis such as parsing. Whilst this ap-
proach makes it difficult to handle some linguistic
challenges such as long-distance dependencies, it
has the advantage of making the system more ro-
bust to unexpected input, which is more important
for subtitles.
We have only been able to sketch the character-
istics of the subtitle text genre in this paper. Dıaz-
Cintas and Remael (2007) provide a detailed intro-
duction, including the linguistics of subtitling and
translation issues, and Pedersen (2007) discusses
the peculiarities of subtitling in Scandinavia.
As a result of this processing and the consid-
erations underlying it, subtitles have a number of
properties that make them especially well suited
for Statistical Machine Translation. Owing to their
presentational constraints, they mainly consist of
comparatively short and simple phrases. Current
SMT systems, when trained on a sufficient amount
of data, have reliable ways of handling word trans-
lation and local structure. By contrast, they are
still fairly weak at modelling long-range depen-
dencies and reordering. Compared to other text
genres, this weakness is less of an issue in the Sta-
tistical Machine Translation of subtitles thanks to
their brevity and simple structure. Indeed, half
of the subtitles in the Swedish part of our par-
allel training corpus are no more than 11 tokens
long, including two tokens to mark the beginning
and the end of the segment and counting every
punctuation mark as a separate token. A consider-
able number of subtitles only contains one or two
words, besides punctuation, often consisting en-
tirely of a few words of affirmation, negation or
abuse. These subtitles can easily be translated by
an SMT system that has seen similar examples be-
fore.
3 Constraint Grammar annotations
To explore the potential of linguistically annotated
data, our complete subtitle corpus, both in Danish
and in Swedish, was linguistically analysed with
the DanGram Constraint Grammar (CG) parser
(Bick, 2001), a system originally developed for
the analysis of Danish for which there is also a
Swedish grammar. Constraint Grammar (Karls-
son, 1990) is a formalism for natural language
parsing. Conceptually, a CG parser first produces
possible analyses for each word by considering its
morphological features and then applies constrain-
ing rules to filter out analyses that do not fit into
the context. Thus, the word forms are gradually
disambiguated, until only one analysis remains;
multiple analyses may be retained if the sentence
is ambiguous.
The annotations produced by the DanGram
parser were output as tags attached to individual
words as in the following example:
The orientation of the genre towards spoken lan-
guage also has some disadvantages for Machine
Translation systems. It is possible that the lan-
guage of the subtitles, influenced by characteris-
tics of speech, contains unexpected features such
as stutterings, word repetitions or renderings of
non-standard pronunciations that confuse the sys-
tem. Such features are occasionally employed by
subtitlers to lend additional colour to the text, but
as they are in stark conflict with the ideals of unob-
$-
Vad[vad]<interr>INDPNEUSNOM@ACC>
vet[veta]<mv>VPRAKT@FS-QUE
du [du]PERS2SUTRSNOM@<SUBJ
om [om]PRP@<PIV
det[den]<dem>PERSNEU3SACC@P<
$?
In addition to the word forms and the accompany-
ing lemmas (in square brackets), the annotations
58
 
UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles
contained part-of-speech (POS) tags such as INDP
for “independent pronoun” or V for “verb”, a mor-
phological analysis for each word (such as NEUS
NOM for “neuter singular nominative”) and a tag
specifying the syntactic function of the word in
the sentence (such as @ACC> , indicating that the
sentence-initial pronoun is an accusative object of
the following verb). For some words, more fine-
grained part-of-speech information was specified
in angle brackets, such as <interr> for “interrog-
ative pronoun” or <mv> for “verb of movement”.
In our experiments, we used word forms, lemmas,
POS tags and morphological analyses. The fine-
grained POS tags and the syntax tags were not
used.
language requires a different word order, reorder-
ing is possible at the cost of a score penalty. The
translation model has no notion of sequence, so
it cannot control reordering. The language model
can, but it has no access to the source language
text, so it considers word order only from the point
of view of TL grammaticality and cannot model
systematic differences in word order between two
languages. Lexical reordering models (Koehn et
al., 2005) address this issue in a more explicit way
by modelling the probability of certain changes in
word order, such as swapping words, conditioned
on the source and target language phrase pair that
is being processed.
In its basic form, Statistical Machine Transla-
tion treats word tokens as atomic and does not
permit further decomposition or access to single
features of the words. Factored SMT (Koehn and
Hoang, 2007) extends this model by represent-
ing words as vectors composed of a number of
features and makes it possible to integrate word-
level annotations such as those produced by a Con-
straint Grammar parser into the translation pro-
cess. The individual components of the feature
vectors are called factors . In order to map be-
tween different factors on the target language side,
the Moses decoder works with generation mod-
els , which are implemented as dictionaries and ex-
tracted from the target-language side of the train-
ing corpus. They can be used, e. g., to generate
word forms from lemmas and morphology tags, or
to transform word forms into part-of-speech tags,
which could then be checked using a language
model.
4 Factored Statistical Machine
Translation
Statistical Machine Translation formalises the
translation process by modelling the probabilities
of target language (TL) output strings T given a
source language (SL) input string S , p ( T|S ), and
conducting a search for the output string T with
the highest probability. In the Moses decoder
(Koehn et al., 2007), which we used in our exper-
iments, this probability is decomposed into a log-
linear combination of a number of feature func-
tions h i ( S,T ), which map a pair of a source and a
target language element to a score based on differ-
ent submodels such as translation models or lan-
guage models. Each feature function is associated
with a weight l i that specifies its contribution to
the overall score:
T =arg max
T
log p ( T|S )
5 Experiments with the full corpus
=arg max
T
å i
l i h i ( S,T )
We ran three series of experiments to study the
effects of different SMT system setups on trans-
lation quality with three different configurations
of training corpus sizes. For each condition, sev-
eral Statistical Machine Translation systems were
trained and evaluated.
In the full data condition, the complete system
was trained on a parallel corpus of some 900,000
subtitles with source language Swedish and target
language Danish, corresponding to around 10 mil-
lion tokens in each language. The feature weights
were optimised using minimum error rate train-
ing (Och, 2003) on a development set of 1,000
subtitles that had not been used for training, then
the system was evaluated on a 10,000 subtitle test
The translation models employed in factored
SMT are phrase-based. The phrases included in
a translation model are extracted from a word-
aligned parallel corpus with the techniques de-
scribed by Koehn et al. (2003). The associated
probabilities are estimated by the relative frequen-
cies of the extracted phrase pairs in the same cor-
pus. For language modelling , we used the SRILM
toolkit (Stolcke, 2002); unless otherwise specified,
6-gram language models with modified Kneser-
Ney smoothing were used.
The SMT decoder tries to translate the words
and phrases of the source language sentence in the
order in which they occur in the input. If the target
59
 
ChristianHardmeierandMartinVolk
set that had been held out during the whole de-
velopment phase. The translations were evalu-
ated with the widely used BLEU and NIST scores
(Papineni et al., 2002; Doddington, 2002). The
outcomes of different experiments were compared
with a randomisation-based hypothesis test (Co-
hen, 1995, 165–177). The test was two-sided, and
the confidence level was fixed at 95 %.
The results of the experiments can be found in
table 1. The baseline system used only a transla-
tion model operating on word forms and a 6-gram
language model on word forms. This is a stan-
dard setup for an unfactored SMT system. Two
systems additionally included a 6-gram language
model operating on part-of-speech tags and a 5-
gram language model operating on morphology
tags, respectively. The annotation factors required
by these language models were produced from the
word forms by suitable generation models.
In the full data condition, both the part-
of-speech and the morphology language model
brought a slight, but statistically significant gain
in terms of BLEU scores, which indicates that
abstract information about grammar can in some
cases help the SMT system choose the right words.
The improvement is small; indeed, it is not re-
flected in the NIST scores, but some beneficial ef-
fects of the additional language models can be ob-
served in the individual output sentences.
One thing that can be achieved by taking word
class information into account is the disambigua-
tion of ambiguous word forms. Consider the fol-
lowing example:
ditional language models helped to rule out this
error and correctly translate mitt emot as over for ,
yielding a much better translation. Neither of them
output the adverb lige ‘just’ found in the reference
translation, for which there is no explicit equiva-
lent in the input sentence.
In the next example, the POS and the morphol-
ogy language model produced different output:
Input: Daliga kontrakt, dalig ledning, daliga agenter.
Reference: Darlige kontrakter, darlig styring, darlige
agenter.
Baseline: Darlige kontrakt, darlig forbindelse, darlige
agenter.
POS: Darlige kontrakt, darlig ledelse, darlige agenter.
Morphology: Darlige kontrakter, darlig forbindelse,
darlige agenter.
In Swedish, the indefinite singular and plu-
ral forms of the word kontrakt ‘contract(s)’ are
homonymous. The two SMT systems without sup-
port for morphological analysis incorrectly pro-
duced the singular form of the noun in Danish.
The morphology language model recognised that
the plural adjective darlige ‘bad’ is more likely
to be followed by a plural noun and preferred
the correct Danish plural form kontrakter ‘con-
tracts’. The different translations of the word
ledning as ‘management’ or ‘connection’ can be
pinned down to a subtle influence of the generation
model probability estimates. They illustrate how
sensitive the system output is in the face of true
ambiguity. None of the systems presented here has
the capability of reliably choosing the right word
based on the context in this case.
In three experiments, the baseline configuration
was extended by adding lexical reordering mod-
els conditioned on word forms, lemmas and part-
of-speech tags, respectively. As in the language
model experiments, the required annotation fac-
tors on the TL side were produced by generation
models.
The lexical reordering models turn out to be
useful in the full data experiments only when con-
ditioned on word forms. When conditioned on
lemmas, the score is not significantly different
from the baseline score, and when conditioned on
part-of-speech tags, it is significantly lower. In this
case, the most valuable information for lexical re-
ordering lies in the word form itself. Lemma and
part of speech are obviously not the right abstrac-
tions to model the reordering processes when suf-
ficient data is available.
Input: Ingen vill bo mitt emot en ismaskin.
Reference: Ingen vil bo lige over for en ismaskine.
Baseline: Ingen vil bo mit imod en ismaskin.
POS/Morphology: Ingen vil bo over for en ismaskin.
Since the word ismaskin ‘ice machine’ does not
occur in the Swedish part of the training corpus,
none of the SMT systems was able to translate it.
All of them copied the Swedish input word liter-
ally to the output, which is a mistake that cannot
be fixed by a language model. However, there is a
clear difference in the translation of the phrase mitt
emot ‘opposite’. For some reason, the baseline
system chose to translate the two words separately
and mistakenly interpreted the adverb mitt , which
is part of the Swedish expression, as the homony-
mous first person neuter possessive pronoun ‘my’,
translating the Swedish phrase as ungrammatical
Danish mit imod ‘my against’. Both of the ad-
60
 
UsingLinguisticAnnotationsinStatisticalMachineTranslationofFilmSubtitles
Table 1 Experimental results
full data
symmetric
asymmetric
BLEU NIST
BLEU NIST
BLEU NIST
Baseline
53.67 % 8.18
42.12 % 6.83
44.85 % 7.10
Language models
parts of speech ? 53.90 % 8.17 ? 42.59 % 6.87 44.71 % 7.08
morphology ? 54.07 % 8.18 ? 42.86 % 6.92 ? 44.95 % 7.09
Lexical reordering
word forms ? 53.99 % 8.21
42.13 % 6.83
44.72 % 7.05
lemmas
53.59 % 8.15 ? 42.30 % 6.86
44.71 % 7.06
parts of speech
53.36 % 8.13 ? 42.33 % 6.86
44.63 % 7.05
Analytical translation
53.73 % 8.18 ? 42.28 % 6.90 ? 46.73 % 7.34
? BLEU score significantly above baseline ( p<. 05)
BLEU score significantly below baseline ( p<. 05)
Another system, which we call the analytical
translation system, was modelled on suggestions
by Koehn and Hoang (2007) and Bojar (2007). It
used the lemmas and the output of the morpholog-
ical analysis to decompose the translation process
and use separate components to handle the transfer
of lexical and grammatical information. In order
to achieve this, the baseline system was extended
with additional translation tables mapping SL lem-
mas to TL lemmas and SL morphology tags to TL
morphology tags, respectively. In the target lan-
guage, a generation model was used to transform
lemmas and morphology tags into word forms.
The results reported by Koehn and Hoang (2007)
strongly indicate that this translation approach is
not sufficient on its own; instead, the decomposed
translation approach should be combined with a
standard word form translation model so that one
can be used in those cases where the other fails.
This configuration was therefore adopted for our
experiments.
The analytical translation approach fails to
achieve any significant score improvement with
the full parallel corpus. Closer examination of
the MT output reveals that the strategy of using
lemmas and morphological information to trans-
late unknown word forms works in principle, as
shown by the following example:
Input: Molly har visat mig br ollopsfotona.
Reference: Molly har vist mig fotoene fra brylluppet.
Baseline: Molly har vist mig br ollopsfotona.
Analytical: Molly har vist mig bryllupsbillederne.
In this sentence, there can be no doubt that the out-
put produced by the analytical system is superior
to that of the baseline system. Where the base-
line system copied the Swedish word brollops-
fotona ‘wedding photos’ literally into the Dan-
ish text, the translation found by the analytical
model, bryllupsbillederne ‘wedding pictures’, is
both semantically and syntactically flawless. Un-
fortunately, the reference translation uses different
words, so the evaluation scores will not reflect this
improvement.
The lack of success of analytical translation in
terms of evaluation scores can be ascribed to at
least three factors: Firstly, there are relatively few
vocabulary gaps in our data, which is due to the
size of training corpus. Only 1.19 % (1,311 of
109,823) of the input tokens are tagged as un-
known by the decoder in the baseline system. As
a result, there is not much room for improvement
with an approach specifically designed to handle
vocabulary coverage, especially if this approach
itself fails in some of the cases missed by the base-
line system: Analytical translation brings this fig-
ure down to 0.88 % (970 tokens), but no further.
Secondly, employing generation tables trained on
the same corpus as the translation tables used by
the system limits the attainable gains from the out-
set, since a required word form that is not found in
the translation table is likely to be missing from
the generation table, too. Thirdly, in case of vo-
cabulary gaps in the translation tables, chances
are that the system will not be able to produce
the optimal translation for the input sentence. In-
stead, an approach like analytical translation aims
61
381580661.001.png 381580661.002.png
Zgłoś jeśli naruszono regulamin