|Computational Linguistics in the Netherlands (CLIN) 2007|
Abstracts parallel session 4
Memory-based machine translation: Case studies in Dutch-English translation
Memory-based machine translation is a variant of example-based machine translation in which a classifier maps source language words in context to n-grams of target language words to which the source word is aligned. In a generate-and-search process, the predicted n-grams are then arranged in possible target language sequences driven by their mutual overlap.
Two key differences between this system and the popular phrase-based statistical machine translation (PB-SMT) approach are (1) that PB-SMT does not use a discriminatory classifier (a filter) to limit the possible phrases to use in the second phase, and (2) that PB-SMT does not base its target sequences on overlap of phrases, but on an external language model.
In this paper we compare the two approaches, demonstrate equivalence of the memory-based approach to stochastic n-gram models, and compare various hybrids of the two. The language pair on which the demonstration focuses is English and Dutch. We train and test our memory-based classifiers on the Europarl and JRC-Acquis parallel corpora.
The -t inflection of singular present tense verbs of which the stem ends in d is a source of common writing errors in Dutch. In this paper we describe the development of a machine-learning-based disambiguator that determines whether to inflect the verb with -t, on the basis of its local context. Obtaining training examples is straightforward: they can be drawn from any Dutch text corpus. Optionally an accurate POS-tagger can be used to generate additional features.
We approach the disambiguation from four angles. First, we train individual classifiers on the disambiguation of the forms of a single verb (e.g. word-wordt). Second, we train a monolithic classifier on all d-dt examples. The third classifier is trained to determine whether any verb, ending in -d or anything else, receives a t-inflection in context. Finally, a memory-based language model merely predicts the appropriate wordform given the local context. We also develop combinations of these classifiers. For example, we combine the global d-dt disambiguator with a limited number of word-specific disambiguators that are more accurate than the global classifier.
We evaluate the classifiers in three ways. First, we measure their disambiguation accuracy, which in general exceeds 99%. Second, we analyse a random sample of the errors made by the disambiguator. This provides, among other things, a weak estimate of the amount of corpus errors, which the classifier has found and corrected. Third, to focus on the precision of the error correction, we apply the disambiguator to an external test set derived from the internet.
Many youngsters and preadolescents have troubles with spelling, but also professional language users regularly have questions concerning the spelling of certain words and could benefit from an easily accessible tool.
Several different ways to check the spelling of words already exist, but Spelspiek adds a dynamic feature: you can use Spelspiek through modern communication interfaces like web browsers, MSN or SMS. Moreover, it is possible to ask spelling questions using natural language. Spelspiek uses lexical data provided by INL and Van Dale. The Spelspiek software integrates spelling correction software by Polderland and chatbot software by Elitech.
In this talk, we describe a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Sub-sentential alignments are used a.o. to create phrase tables for statistical phrase-based machine translation (SMT) systems. However, a stand-alone sub-sentential alignment module is also useful for human translators if incorporated in CAT-tools, e.g. sophisticated bilingual concordance systems, or in sub-sentential translation memory systems.
In existing SMT systems, a phrase is not linguistically motivated, it can be any contiguous sequence of words. We expect that the use of linguistically relevant phrases can improve the performance of phrase-based SMT systems.
We present the first results of our sub-sentential alignment system, which links linguistically motivated chunks based on lexical clues (word alignments) and syntactic similarity measures. In our baseline system, on average 30% of the words can be linked with a precision ranging from 93% to 98%.
On behalf of the Dutch Royal Library - The Hague we have studied and partially solved the problem of OCR-induced typographical variation.
Text-Induced Corpus Clean-up or TICCL (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and exhaustively gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (further: LD). It next employs effective text-induced filtering techniques to retain as many as possible of the true positives, while discarding as many as possible of the false positives.
TICCL has been evaluated on a contemporary OCR-ed text corpus, the Staten-Generaal Digitaal 1989-1995 (SGD) and on a corpus of historical newspaper articles, i.e. 'Het Volk 1918' (HV). The latter presents greater challenges: its OCR-quality is far lower and it is in the older Dutch spelling `De Vries-Te Winkel'. We have annotated representative samples of typographical variants from both corpora, allowing us not only to evaluate our system, but also to draw effective conclusions towards the adaptation of the correction mechanism to OCR-error resolution
For both corpora, TICCL obtains a cumulative F-score of around 95% at LD 2, with recall at around 99%. If one would look no further than 2 edits, these scores mean that for the SGD almost 89% and for HV almost 55% of the undesirable OCR-induced typographical variation present can fully automatically be removed, as these are the summed percentages of LD 1 and 2 errors observed in our 5,047 SGD and 3,799 HV error samples.
Correctly handling the translation of terminology is a key-component of successful Machine Translation systems. An easy way out commercial systems usually offer their users is the possibility to add new terms and their translations into dedicated lexicons.
Recently, Langlais and Patry (2007) have shown that analogical learning offers an elegant way to translate unknown words. In this work, we analyze how suitable the approach is to translating terms.
The first part of the presentation describes the algorithms required for manipulating analogical equations. An analogical equation noted A:B::C:D is a relation between four items (e.g. words), which reads as "A is to B as C is to D"; a typical example of which is reader:unreadable::doer:undoable. In a second part we describe how this concept can be applied to the problem of translating terms. As a case study, we applied the approach to the problem of translating medical terms from English into French, and vice-versa. We observed that roughly half of the terms can be translated successfully.
One drawback of the approach proposed by Langlais and Patry is the need for an external target language lexicon to filter out spurious analogical equations (a concept we will clarify). We propose to tackle this issue by training a binary classifier and will report on its performance on the task we addressed in this study.
Phrase-based and syntax-based methods in statistical machine translation (MT) have complementary strengths and shortcomings. Phrase-based methods typically neglect discontiguous dependencies, while syntax-based methods neglect low-level lexical dependencies. What would be needed is a system that takes into account contiguous as well as discontiguous phrases, regardless whether they form linguistically motivated constituents. In this paper we start an investigation into using a successful unsupervised parsing system, known as U-DOP, for providing the tree structures for bilingual corpora such as the Europarl corpus. U-DOP induces a probabilistic tree-substitution grammar from raw data and has achieved competitive parsing results. We will use the structures induced by U-DOP for extending the so-called Data-Oriented Translation (DOT) system towards Unsupervised DOT, which we will term U-DOT. The U-DOT model starts by assigning all possible alignments between paired trees bootstrapped by U-DOP and uses a large sample of possible subtree pairs to compute the most probable target sentence given a source sentence. This results in an MT model which takes into account contiguous as well as discontiguous phrases. We test U-DOT on the German-English Europarl corpus, showing that the inclusion of discontiguous phrases significantly improves the translation accuracy. Most of these phrases correspond to discontiguous verb phrases that are highly frequent in Germanic languages but that are not captured by well-known phrase-based MT systems like Pharaoh.
The METIS project, an effort to develop hybrid machine translation methods for language pairs without bilingual corpora, entails a great deal of sorting through example sentences for sufficiently similar examples. Best matches are discovered by comparing chunk patterns in input sentences to those of stored target language sentences, taking into account phrase types, parts of speech and the lemmas of the headwords of the constituent phrases. The best matches are those which, given the data available, hopefully best inform correct translation. This class of inexact matching problem is difficult to undertake on a large scale and is ill-suited to many conventional indexing techniques. However, viewed as a constrained form of tree structure matching, the problem fits within the scope of relatively recently developped techniques in semi-structured data mining. This presentation will introduce a simplified variant of the TreeMiner and FREQT algorithms, operating efficiently within the limited class of linguistic data structure used to perform matching in this project and demonstrate its application to fast search for both exact and inexact best matches for sentence structures within large corpora.
The interdisciplinary project on rule-based search in text databases with nonstandard orthography develops mechanisms to ease working with documents that contain spelling variants and recognition errors. Its methods for automatic support allow for a constant reduction of human intervention. Even though the quality of optical character recognition has greatly increased in recent years, suboptimal sources like old documents, texts of poor quality or those in historical fonts still prove to be highly problematic. Historical spelling variation is documented for most European languages and even Chinese. The German unification of orthography took place only a little more than a century ago, leaving about six hundred years of ample nonstandardized (Early) High German text production. This paper describes the application of a Na´ve Bayes classifierŚcommonly used for spam filteringŚin combination with several other filters to separate spelling variants, recognition errors and standard spellings in electronic documents. The detection of nonstandard spellings is a precondition for reliable automatic error correction. Using the knowledge gathered from our database of historical texts, the proportion of variant spellings in an unknown text allows for the deduction of a diachronic classification. Likewise, the extracted spellings can be used directly in the stochastic learning mechanism of our fuzzy search engine to improve its retrieval.
In the METIS project, we combine techniques from rule-based and corpus-based MT in a hybrid approach. We restrict ourselves to basic resources only, in order to develop an MT methodology for lesser resourced language pairs.
In the current system, we use basic source language analysis tools, like a tagger and a chunker. The transition from source to target language is done through a dictionary, tag mapping rules, and a limited set of transfer rules, mainly to map source language tenses onto target language tenses.
One of the problems of the system as it is described in Vandeghinste, Dirix, and Schuurman (2007) is that the system is not always capable of generating one preferred translation. This makes it harder to evaluate the system, using automated metrics like BLEU or NIST.
By adding coocurrence metrics, we are able to improve the results of our system, and to differentiate between several translation candidates. The cooccurrence metrics are based on the average distance between content words, restricted to a limited window size.
An evaluation experiment is set up containing conditions with and without these cooccurrence metrics.