Abstracts poster session

A natural language interface to a cinema database

Bayesian Model Merging for Unsupervised Constituent Labeling

The MiniSTEx ontology

Reference vs. Assertion - New answers to an Old Problem

An ensemble of Spanish dependency parsers

The detection of noun phrases in Dutch newspapers

Home Call for abstracts Abstract submission Important dates Location Programme Registration Proceedings Local Organization Sponsors FAQ Pictures

CLIN 2007

Friday December 7, 2007
University of Nijmegen

CLIN 2007 is organized by the Language and Speech group of the Radboud University Nijmegen.

A natural language interface to a cinema database
Bayesian Model Merging for Unsupervised Constituent Labeling
Finding features for detecting discourse relations between sentences
The MiniSTEx ontology
Reference vs. Assertion - New answers to an Old Problem
An ensemble of Spanish dependency parsers
The detection of noun phrases in Dutch newspapers

Ana Guimarăes, Luísa Coheur, Nuno J. Mamede (L2F INESC-ID)

With the widespread of keyword search engines, Natural Language Interfaces to Databases (NLIDB) are no longer as popular as they were in the 80’s and are now considered a branch of QA systems. Nevertheless, NLIDB have their own advantages as they put together natural language input and structured information sources, allowing many answers to be directly retrieved. Therefore, they can lead to powerful QA sub-systems in specific domains.
Aiming the integration on a general QA system, we present a Portuguese interface to a multi-language cinema database. The system is based on a deep semantic analysis of the question performed by a robust natural language processing chain. Because the database has more than 2.500.000 movie titles, actors and staff names, named entities recognition (NER) and disambiguation became challenging tasks.
Interleaved with the deep semantic analysis, the database itself is used to NER by using full text queries. This paper reports the pros and cons of this approach. Moreover, several strategies are used to deal with ambiguity: context information is used in the simpler occasions, but in the other occasions the user chooses between the available options. Also, some preliminary studies on the usability of the interface are presented.

Borensztajn, G. and Zuidema, W. (ILLC, UvA, Amsterdam)

Recent research on unsupervised grammar induction has focused on inducing accurate bracketing of sentences. Here we present an efficient, Bayesian algorithm for the unsupervised induction of syntactic categories from such bracketed text.
Our model gives state-of-the-art results on this task, using gold-standard bracketing, outperforming the recent semi-supervised approach of Haghighi and Klein, 2006, obtaining an F1 of 76.8% (when appropriately relabeled). Our algorithm assigns comparable likelihood to unseen text as the treebank PCFG. Finally, we discuss the metrics used and linguistic relevance of the results.

Ineke Schuurman, Vincent Vandeghinste (Centrum voor Computerlinguďstiek, K.U.Leuven)

The MiniSTEx system is meant to locate the spatiotemporal characteristics of eventualities on a time axis and/or a map. The information we use in order to do so varies from knowledge on constituency, over knowledge on the type of verbs and objects involved, to several kinds of spatiotemporal knowledge stored in a database.
In this presentation we will explain the rationale behind our ontology, with ample attention for the similarities and differences between the various parts (temporal, geospatial, spatial). We will also pay attention to the consequences of lacking information: what if we don't know whether we are dealing with an event, state or process, or if we don't know the coordinates (longutude, lattitude) of a town? Or not even which town is meant? How do we deal with this? Which types of information are optional, and why?

Streit, Michael (DFKI)

Discourse representation theories do not clearly distinguish between information that is used to describe objects and information that is asserted about objects. But the distinction is relevant: If a dialog participant, lets call him A, describes an object, participant B may react with a comment on the existence, non-existence, or uniqueness of the object, if A makes an assertion about an object, B may react with a comment on the truth of the assertion (in fact things may be more complicated, because participants may not share the same view of what is going on). To view this issue as a subclass of presupposition analysis falls short of the problem. The determination of an object (or a set of objects) is not only a matter of definite NPs and the like. Instead, a sequence of several sentences may be devoted to the successive specification of an object in an more or less explicit way. Ignoring this causes certain mistakes: Presupposition as they are usually ascribed to definite NPs do not express what is intended, because the description of the object is not finished at the moment the definite NP is introduced. Information that is meant as contribution to the determination of an object is taken as assertion,and assertions may - implicitly - be taken as part of descriptions of objects (as a consequence of the evaluation algorithm of DRT). Both leads to misinterpretations and inadequate dialog behavior.
The problem is hardly address directly in the literature. Therefore I would expose the problem in some detail and also address its relation to the influential notion of rigidity and to the discussion about E-type approach vs. the “variable” approach of DRT, before sketching my approach to an extended discourse representation, that distinguishes between reference identification and assertions.

Roser Morante and Antal van den Bosch (Tilburg University)

We present an ensemble system for dependency parsing of Spanish text, combining three machine-learning-based dependency parsers: Nivre's MaltParser, Canisius' memory-based constraint-satisfaction inference parser, and a new memory-based parser based on a single word-pair relation classifier. The dependency graphs generated by this ensemble will be used as input for a semantic role labeler, that relies on them being correct. Since parsing ensemble methods have been generally shown to improve over their participating elements, it seems appropriate to develop an ensemble system that includes the current best Spanish dependency parser, MaltParser.
The presented ensemble system operates in two stages. In the first stage, each of the three parsers analyzes an input sentence and produces a dependency graph. Unlabeled attachment scores in this stage range from 82 to 86%. In the second stage, a weighted voting system is applied that distills a final dependency graph out of the three first-stage dependency graphs. Weighting is based on per- relation per-parser accuracy scores obtained on held-out material.
The experiments are carried out on the Cast3LB-CoNLL-SemRol Corpus of Spanish, a revised version of the Cast3LB treebank, containing 89,199 words with automatically generated morpho-syntactic information, and non-projective dependency graphs over the 3,303 sentences.
We will describe the experiments and analyze the results, both quantitively (in terms of the CoNLL-2006 evaluation metrics) and in terms of the qualitative differences in performances between the three parsers and the ensemble system.

Computational Linguistics in the Netherlands (CLIN) 2007

Home Call for abstracts Abstract submission Important dates Location Programme Registration Proceedings Local Organization Sponsors FAQ Pictures

CLIN 2007

Friday December 7, 2007
University of Nijmegen

CLIN 2007 is organized by the Language and Speech group of the Radboud University Nijmegen.

Abstracts poster session

A natural language interface to a cinema database
Bayesian Model Merging for Unsupervised Constituent Labeling
Finding features for detecting discourse relations between sentences
The MiniSTEx ontology
Reference vs. Assertion - New answers to an Old Problem
An ensemble of Spanish dependency parsers
The detection of noun phrases in Dutch newspapers

A natural language interface to a cinema database
Ana Guimarăes, Luísa Coheur, Nuno J. Mamede (L2F INESC-ID)

With the widespread of keyword search engines, Natural Language Interfaces to Databases (NLIDB) are no longer as popular as they were in the 80’s and are now considered a branch of QA systems. Nevertheless, NLIDB have their own advantages as they put together natural language input and structured information sources, allowing many answers to be directly retrieved. Therefore, they can lead to powerful QA sub-systems in specific domains.
Aiming the integration on a general QA system, we present a Portuguese interface to a multi-language cinema database. The system is based on a deep semantic analysis of the question performed by a robust natural language processing chain. Because the database has more than 2.500.000 movie titles, actors and staff names, named entities recognition (NER) and disambiguation became challenging tasks.
Interleaved with the deep semantic analysis, the database itself is used to NER by using full text queries. This paper reports the pros and cons of this approach. Moreover, several strategies are used to deal with ambiguity: context information is used in the simpler occasions, but in the other occasions the user chooses between the available options. Also, some preliminary studies on the usability of the interface are presented.

Bayesian Model Merging for Unsupervised Constituent Labeling
Borensztajn, G. and Zuidema, W. (ILLC, UvA, Amsterdam)

Recent research on unsupervised grammar induction has focused on inducing accurate bracketing of sentences. Here we present an efficient, Bayesian algorithm for the unsupervised induction of syntactic categories from such bracketed text.
Our model gives state-of-the-art results on this task, using gold-standard bracketing, outperforming the recent semi-supervised approach of Haghighi and Klein, 2006, obtaining an F1 of 76.8% (when appropriately relabeled). Our algorithm assigns comparable likelihood to unseen text as the treebank PCFG. Finally, we discuss the metrics used and linguistic relevance of the results.

The MiniSTEx ontology
Ineke Schuurman, Vincent Vandeghinste (Centrum voor Computerlinguďstiek, K.U.Leuven)

The MiniSTEx system is meant to locate the spatiotemporal characteristics of eventualities on a time axis and/or a map. The information we use in order to do so varies from knowledge on constituency, over knowledge on the type of verbs and objects involved, to several kinds of spatiotemporal knowledge stored in a database.
In this presentation we will explain the rationale behind our ontology, with ample attention for the similarities and differences between the various parts (temporal, geospatial, spatial). We will also pay attention to the consequences of lacking information: what if we don't know whether we are dealing with an event, state or process, or if we don't know the coordinates (longutude, lattitude) of a town? Or not even which town is meant? How do we deal with this? Which types of information are optional, and why?

Reference vs. Assertion - New answers to an Old Problem
Streit, Michael (DFKI)

Discourse representation theories do not clearly distinguish between information that is used to describe objects and information that is asserted about objects. But the distinction is relevant: If a dialog participant, lets call him A, describes an object, participant B may react with a comment on the existence, non-existence, or uniqueness of the object, if A makes an assertion about an object, B may react with a comment on the truth of the assertion (in fact things may be more complicated, because participants may not share the same view of what is going on). To view this issue as a subclass of presupposition analysis falls short of the problem. The determination of an object (or a set of objects) is not only a matter of definite NPs and the like. Instead, a sequence of several sentences may be devoted to the successive specification of an object in an more or less explicit way. Ignoring this causes certain mistakes: Presupposition as they are usually ascribed to definite NPs do not express what is intended, because the description of the object is not finished at the moment the definite NP is introduced. Information that is meant as contribution to the determination of an object is taken as assertion,and assertions may - implicitly - be taken as part of descriptions of objects (as a consequence of the evaluation algorithm of DRT). Both leads to misinterpretations and inadequate dialog behavior.
The problem is hardly address directly in the literature. Therefore I would expose the problem in some detail and also address its relation to the influential notion of rigidity and to the discussion about E-type approach vs. the “variable” approach of DRT, before sketching my approach to an extended discourse representation, that distinguishes between reference identification and assertions.

An ensemble of Spanish dependency parsers
Roser Morante and Antal van den Bosch (Tilburg University)

We present an ensemble system for dependency parsing of Spanish text, combining three machine-learning-based dependency parsers: Nivre's MaltParser, Canisius' memory-based constraint-satisfaction inference parser, and a new memory-based parser based on a single word-pair relation classifier. The dependency graphs generated by this ensemble will be used as input for a semantic role labeler, that relies on them being correct. Since parsing ensemble methods have been generally shown to improve over their participating elements, it seems appropriate to develop an ensemble system that includes the current best Spanish dependency parser, MaltParser.
The presented ensemble system operates in two stages. In the first stage, each of the three parsers analyzes an input sentence and produces a dependency graph. Unlabeled attachment scores in this stage range from 82 to 86%. In the second stage, a weighted voting system is applied that distills a final dependency graph out of the three first-stage dependency graphs. Weighting is based on per- relation per-parser accuracy scores obtained on held-out material.
The experiments are carried out on the Cast3LB-CoNLL-SemRol Corpus of Spanish, a revised version of the Cast3LB treebank, containing 89,199 words with automatically generated morpho-syntactic information, and non-projective dependency graphs over the 3,303 sentences.
We will describe the experiments and analyze the results, both quantitively (in terms of the CoNLL-2006 evaluation metrics) and in terms of the qualitative differences in performances between the three parsers and the ensemble system.

The detection of noun phrases in Dutch newspapers
Sandra Kanters (Radboud Universiteit Nijmegen)

For several applications, Polderland Language & Speech Technology needs a high performance noun phrase (NP) detector for Dutch. This detector is, among other things, intended as an extension to KLiP, an already existing tool of Polderland. KLiP performs linguistic analysis on natural language input text and is used in information retrieval. As NPs contain a lot of information about the contents of documents, the detection of these phrases is a useful addition to KLiP.
In my study, firstly the Algemene Nederlandse Spraakkunst and corpus material (from the newspaper De Groene Amsterdammer) were consulted and rules for NP structure were manually formulated. Secondly, the rules were implemented in RecDescent, a Perl module with which top-down recursive-descent parsers can be built. After implementing the NP detector, tests were carried out both on sentences that were extracted from the same newspaper as the training corpus and on sentences from several other newspapers. The precision and recall percentages varied from about 55% to about 96%. Better results were obtained for less complex noun phrases. Furthermore, mismatches (incorrect accepts and incorrect rejects) for the less complex NPs were mostly due to restrictions of the rules. For the more complex NPs, ambiguity of the language played a larger role.
In the presentation I will give a summary of the research carried out. I will explain the definition of the noun phrase used in my study and I will go into more detail about the rules and the test results.

Sandra Kanters (Radboud Universiteit Nijmegen)

For several applications, Polderland Language & Speech Technology needs a high performance noun phrase (NP) detector for Dutch. This detector is, among other things, intended as an extension to KLiP, an already existing tool of Polderland. KLiP performs linguistic analysis on natural language input text and is used in information retrieval. As NPs contain a lot of information about the contents of documents, the detection of these phrases is a useful addition to KLiP.
In my study, firstly the Algemene Nederlandse Spraakkunst and corpus material (from the newspaper De Groene Amsterdammer) were consulted and rules for NP structure were manually formulated. Secondly, the rules were implemented in RecDescent, a Perl module with which top-down recursive-descent parsers can be built. After implementing the NP detector, tests were carried out both on sentences that were extracted from the same newspaper as the training corpus and on sentences from several other newspapers. The precision and recall percentages varied from about 55% to about 96%. Better results were obtained for less complex noun phrases. Furthermore, mismatches (incorrect accepts and incorrect rejects) for the less complex NPs were mostly due to restrictions of the rules. For the more complex NPs, ambiguity of the language played a larger role.
In the presentation I will give a summary of the research carried out. I will explain the definition of the noun phrase used in my study and I will go into more detail about the rules and the test results.