Home   Call for abstracts   Abstract submission   Important dates   Location   Programme   Registration   Proceedings   Local Organization   Sponsors   FAQ   Pictures  

CLIN 2007

Friday December 7, 2007
University of Nijmegen

CLIN 2007 is organized by the Language and Speech group of the Radboud University Nijmegen.

Abstracts parallel session 2

Dutch Word Sketches
Semi-automatic glossary creation from learning objects
Detecting semantic overlap: Announcing a parallel monolingual treebank for Dutch
Creating a Dead Poets Society: extracting a social network of historical persons from the web
Merging LT4eL with the Cornetto Database
Evaluating a multi-lingual keyphrase extractor in an eLearning context
Exploiting background knowledge during automatic controlled keyword suggestion
Informative Features for Relation Extraction
The Automatic Detection of Scientific Terms in Patient Information
Deriving Ontologies from Wikipedia for Semantic Annotation of Texts


Dutch Word Sketches

Carole Tiberius (Instituut voor Nederlandse Lexicologie)

In this presentation we discuss setting up the Sketch Engine for use within the Algemeen Nederlands Woordenboek (ANW) project.
The Sketch Engine (Kilgarriff et al. 2004) is a corpus tool which takes as input a corpus of any language and corresponding grammar patterns (regular expressions over POS tags) to produce word sketches for the words of that language in addition to offering the standard corpus query functions. It also automatically generates a thesaurus and 'sketch differences', which specify similarities and differences between near-synonyms.
The aim of the ANW project is to create a corpus-based, electronic dictionary of contemporary Dutch in the Netherlands and Flanders covering the period 1970-2018. The dictionary is based on the ANW corpus, a balanced corpus of just over 100 million words including amongst others present-day literary texts, domain-specific texts,newspaper texts and a corpus of neologisms. The corpus is lemmatised and POS tagged.
The ANW corpus was loaded into the Sketch Engine software. We shall demonstrate the Dutch word sketches and show how they can be used in the lexicographic work of the ANW dictionary.


Semi-automatic glossary creation from learning objects

Eline Westerhout and Paola Monachesi (Universiteit Utrecht)

One of the aims of the Language Technology for eLearning project (www.lt4el.eu) is to show that NLP techniques can be employed to enhance the learning process. One of the tasks within the project is the extraction of definitions for glossary creation and NLP techniques are adapted to this end. This novel application of current techniques, already employed within question answering systems, presents some interesting challenges. The most relevant one is constituted by the corpus of learning objects which includes a variety of text genres and also a variety of authors writing styles that pose a real challenge to computational techniques for automatic identification and extraction of definitions. Furthermore, some of our learning objects are relatively small in size, thus our approach has not only to favor precision but also recall, that is we want to make sure that all the possible definitions present in a text are proposed to the user for the creation of the relevant glossary.
The pattern-based glossary candidate detector that has been developed is capable of extracting different types of definitions in eight languages. Using this approach we obtained an acceptable recall on the most frequent types of definitions (86%), but a low precision (28%). In order to improve the precision obtained with the pattern-based approach, machine learning techniques are applied on the Dutch results to filter out incorrectly extracted definitions, improving thus precision considerably (i.e. more than 50%).


Detecting semantic overlap: Announcing a parallel monolingual treebank for Dutch

Erwin Marsi and Emiel Krahmer (Tilburg University)

The same semantic content can be expressed in many different ways, and this is a major stumbling block for NLP applications such as summarization, question-answering and information extraction. While resources for semantic overlap on the word level exist (e.g., Wordnet), similar resources are lacking for the phrase and sentence level. In this talk, we describe an ongoing effort to build a 1 million word treebank for Dutch, consisting of both parallel texts (with essentially the same semantic content, e.g., different Dutch translations of the same book, autocue-subtitle pairs) and comparable texts (of varying degrees of semantic overlap, e.g., different news reports describing the same event, different answers to questions). We use a three-stage alignment procedure: first matching documents, then sentences in matched documents, and finally phrases and words in the matched sentences. All sentences from the corpus are dependency parsed (using Alpino), and nodes in the dependency trees are aligned and labeled using a limited set of semantic similarity relations. Two new tools for alignment (one for aligning sentences and one for aligning nodes within sentences) were developed for these purposes, and will be briefly described. In the talk, details are given about the corpus collection, the annotation procedure and the results. Finally, we also highlight usage of the corpus and derived tools for the aforement applications


Creating a Dead Poets Society: extracting a social network of historical persons from the web

Gijs Geleijnse (Philips Research)

We present a simple method to extract information from search engine snippets. Although the techniques presented are domain independent, this work focuses on extracting biographical information of historical persons from multiple unstructured sources on the Web. We first similarly find a list of persons and their periods of life by querying the periods and scanning the retrieved snippets for person names. Subsequently, we find biographical information for the persons extracted. In order to get insight in the mutual relations among the persons identified, we create a social network using co-occurrences on the Web. Although we use uncontrolled and unstructured Web sources, the information extracted is reliable. Moreover we show that Web Information Extraction can be used to create both informative and enjoyable applications.


Merging LT4eL with the Cornetto Database

Jantine Trapman, Paola Monachesi (Universiteit Utrecht)

The Language Technology for eLearning (LT4eL - www.lt4el.eu) project aims at improving accessibility and cross-lingual retrieval of dynamic content within the Learning Management System ILIAS. For the cross-lingual document retrieval task a domain ontology in the area of computing has been built. The ontology is linked to the upper ontology DOLCE via WordNet. The ontology forms the backbone of the retrieval task and to this end, it is mapped to lexica of nine languages.
In order to extend the retrieval of Dutch content to new domains, we have merged the LT4eL Dutch lexicon with the Cornetto database. Since Cornetto has also been mapped to WordNet, it is thus possible to connect the two systems. This mapping allows also for a relatively easy way to expand the exisiting ontology. In particular, for the exploration and analysis of existing ontologies we have used Watson, a semantic web search engine. We show how and to which extend Watson's functionalities support the task of seeding existing ontologies.
The merge between the Cornetto database and LT4eL lexicon, improves cross-lingual document retrieval in different ways: not only it is possible now to search within a larger domain, Cornetto also provides additional lexical information which will be exploited in several ways.


Evaluating a multi-lingual keyphrase extractor in an eLearning context

Lothar Lemnitzer and Paola Monachesi (Tuebingen University and Utrecht University)

One of the functionalities developed within the Language Technology for eLearning project (www.LT4el.eu) is the possibility to annotate learning objects semi-automatically with keyphrases that describe them. To this end, a tool has been developed which extracts relevant keyphrases from documents in 8 languages. The approach employed is based on a linguistic processing step which is followed by a filtering step of candidate keywords and their subsequent ranking based on frequency criteria.
Even though the techniques employed in the development of the tool have been widely explored in the NLP and IR communities, the application to the eLearning domain presents new challenges, especially with respect to evaluation. Since our application is different from the construction of a domain ontology or a terminological lexicon, we cannot simply employ evaluation criteria employed in these areas with respect to term extraction.
We discuss three tests and their results which have been developed to provide an evaluation of the performance of the tool: the first one simply measures recall and precision on the basis of an established gold standard, the second one measures inter annotator agreement in order to determine the complexity of the task and to evaluate the performance of the tool with respect to human annotators while the third one measures the appropriateness of the keyword extractor in the context of the semi-automatic metadata annotation of the learning objects.


Exploiting background knowledge during automatic controlled keyword suggestion

Luit Gazendam(1), Veronique Malaise(2), Hennie Brugman(3) ((1)Telematica Instituut, (2)Vrije Universiteit, (3)Max Planck Instituut voor Psycholinguistiek)

The use of keywords for indexing is common practice. In many libaries these keywords are assigned manually by cataloguers who select the keywords from a controlled vocabulary: a thesaurus or an ontology. These thesauri and ontologies are structured by, e.g. ISO-norm 2788-1986 which allows for semantics and background knowledge to be stored in a standardised way: broader term, narrower term, related term, use and use-for relations between terms explicitly model the semantic relations between the terms.
In this talk we present a method which exploits this background knowledge during the extraction of keywords. First the method uses the relations to create semantic clusters from the extracted keywords. Then we weigh the cluster and select the most central terms as keyword suggestions. We recognise the importance of two factors in this method: 1) the richness of the used vocabulary and 2) the clustering and weighting algorithm. We will show the results of experiments in a library environment where we compared automatically suggested keywords to manually assigned keywords. During the experiment we varied the two factors: we implemented a simple and a more elaborated clustering and weighting algorithm and we used a standard thesaurus and an enriched thesaurus.


Informative Features for Relation Extraction

Sophia Katrenko (University of Amsterdam)

In this talk we discuss relation learning and, in particular, focus on the extraction of the informative features from the syntactic structures. Many previous approaches used either the predefined patterns (Rinaldi, 2004) or kernel methods where all possible subsequences of a given structure have been considered (Bunescu, 2005). Instead, we aim at the explicitly extracted features and compare two methods to accomplish this task. One of them is based on the well known statistical measures of the n-gram extraction such as log-likelihood, dice and others (Dunning, 1993); the other uses a grammatical inference (Solan, 2005) to extract the most frequent subsequences. The difference between two approaches lies in the way subsequences are found. While the former uses a local context of a predefined length, the latter attempts to take into account wider context. When tested on several corpora from the biomedical domain, both approaches provide performance comparable the state-of-the-art results.


The Automatic Detection of Scientific Terms in Patient Information

Veronique Hoste, Klaar Vanopstal and Els Lefever (LT3, University College Ghent)

Despite the legislative efforts to improve the readability of patient information, different surveys have shown that respondents still feel distressed by reading the information, or even consider it as fully incomprehensible. Our research deals with one of the sources of distress: the use of scientific terminology in patient information. In order to assess the scale of the problem, we collected a Dutch-English parallel corpus of European Public Assessment Reports (EPARs) which was annotated by 2 annotators. This corpus was used for evaluating and training an automatic approach to scientific term detection.
We investigated the use of a lexicon-based and a learning-based approach which only relies on text-internal clues. Finally, both approaches were combined in an optimized hybrid learning-based term extraction experiment. We show that whereas the lexicon-based approach yields high precision scores on the detection of scientific terms, its coverage remains limited.
The learning-based approach which relies on a limited number of straightforward features (word from, lemma and part-of-speech of the two following and preceding words, presence of numeric symbols and multiple capital letters, initial and final trigram) outperforms the lexicon-based approach. The optimized hybrid learner obtains an F-score of 80\% and remains quite robust despite the highly skewed data set.


Deriving Ontologies from Wikipedia for Semantic Annotation of Texts

Willem-Olaf Huijsen, Christian Wartena and Rogier Brussee (Telematica Instituut)

The detection of concepts mentioned in texts many applications, such as the linking of the concepts to other relevant information items. Two approaches exist: the lexicon-based approach and the approach using lexicalized ontologies. In previous work, we developed the Apolda tool that implements an approach based on lexicalized ontologies. Apolda is a freely available plugin for GATE that can annotate texts on the basis of an OWL ontology that is loaded as a resource in GATE.
In this talk, we describe the methods we used to derive ontological data in OWL syntax from the Dutch Wikipedia, and how these ontologies were used to enhance the annotation by Apolda. The ontologies include the subconcept and "synonymy" relations. The subconcept relation is derived directly from the category structure in Wikipedia. The synonymy relation (more correctly, alternative textual representations) is derived from redirects and alternative link labels. The approach taken is language independent and can be used to derive ontological data taking any Wikipedia category as the top concept.
An ontology of the arts domain extracted from Wikipedia using our methods turned out to be very helpful for finding relevant keywords from short descriptions of video fragments. In a second experiment, the synonymy information derived from Wikipedia also improved the quality of keywords automatically extracted from texts.
We conclude that Wikipedia is a good resource to derive ontological data from that is useful for text analysis applications such as the lexical annotation of texts in Apolda.