|Computational Linguistics in the Netherlands (CLIN) 2007|
Abstracts parallel session 1
Finding features for detecting discourse relations between sentences
Although discourse analysis is considered useful for many applications in the field of language technology, automatic discourse parsing is still problematic. A widely accepted model for discourse analysis is Rhetorical Structure Theory, developed by Mann and Thompson (1988). Soricut and Marcu (2003) have developed the discourse parser SPADE, which detects RST-relations between Elementary Discourse Units (EDUs) within a sentence. An automatic discourse parser that is able to find rhetorical relations at higher levels in the text is not yet available.
Our research focusses on the rhetorical relations between (Multi-) Sentential Discource Units ((M-) SDU) – text spans consisting of one or more sentences – within the same paragraph. The goal of our research is to establish what information is useful in detecting these relations. We therefore simplified the task of discourse parsing to a decision problem in which we decide whether an (M-) SDU is rethorically related to either a preceding or a following (M-) SDU. Employing the RST Corpus (Carlson et al. 2003), we offer this choice to machine learning algorithms together with syntacic, lexical, referential and surface features in order to determine which features are most useful.
The presentation will illustrate our method and the information features applied and will present our conclusions on the benefit of these features for automatic discourse parsing of paragraphs.
In this study we try to create computer models that can predict where and when a Middle Dutch text was written, using the orthography as the discriminating factor. We derive these data-driven models from corpora containing charters from the 13th and 14th century. In contrast with earlier research we will not concentrate on phonetic differences but merely on orthography.
For both location and date, we have to determine an appropriate granularity at which we will attempt classification. For locations, we focus on specific cities or villages for which ten or more charters are available. For dates, we examine various time windows, also containing sufficient charters.
We attempt classification using standard machine learning techniques and related author verification algorithms. The machine learning features we use are character n-grams and preferences for a number of common spelling alternations.
We will discuss the quality of the classification, and what conclusions we can draw from this about diatopic and diachronic spelling variation for 13th and 14th century Middle Dutch charters.
Finding information about people on the World Wide Web is one of the most popular activities of Internet users. Given the high ambiguity of person names and the increasing amount of information on the web, it becomes very important to organize this large amount of information into meaningful clusters referring each to one single individual.
For the SemEval-2007 contest, we approached the task from a double supervised and unsupervised perspective.
First, the data sets (first 100 results for a person name query) were preprocessed by means of a memory-based shallow parser (MBSP).
For the supervised classification, the task was redefined in the form of feature vectors containing disambiguating information on pairs of documents. Therefore, a rich feature space that combines biographic facts (place and date of birth and death), named entitiy features, URL and email addresses and meta information about the web page (URL link of the document and IP address referring to geographical location) was constructed from the content, titles and snippets of the documents.
In addition to this, different clustering approaches were applied on matrices of weighted keywords. For determining the keyword relevance of the words extracted, we have applied Term Frequency Inverse Document Frequency and used WordNet as a means for finding synonymical relations between keywords. Finally the resulting cluster sets were merged by taking the classification output as basic "seed" clusters, which were then enhanced by the results from the keyword clustering experiments.
We will discuss several evaluation issues.
We present the main outcomes of the Stevin COREA project: a corpus annotated with coreferential relations and the evaluation of the coreference resolution module developed in the project.
In the project we developed annotation guidelines for coreference resolution for Dutch and annotated a corpus of 100k tokens. We discuss these guidelines, the annotation tool, and the inter-annotator agreement. We also show a visualization of the annotated relations.
The standard approach to evaluate a coreference resolution system is to compare the predictions of the system to a hand-annotated gold standard test set (cross-validation). A more practical oriented evaluation is to test the usefulness of coreference relation information in an NLP application. We ran experiments with an Information Extraction module for the medical domain, and measure the performance of this module with and without the coreference relation information. We present the results of both this application-oriented evaluation of our system and of a standard cross-validation evaluation. In the latter, we also discuss the influence of different information sources (syntactic and semantic) on coreference resolution accuracy. We end with a discussion of the limitations of our current approach.
Automatic segmentation and classification of acts in Dutch task-oriented information-seeking dialogueJeroen Geertzen (Tilburg University)
The study presented in the paper explores the task of automatic segmentation and classification of dialogue acts in Dutch task-oriented information seeking spoken dialogue with both human-human and human-machine interactions. Different types of machine learning classifiers are applied using various acoustic-prosodic and non-acoustic-prosodic features. The dialogue act taxonomy used (DIT) is a multi-dimensional scheme that groups fine-grained dialogue acts that are related to the same aspect of communication in the same dimension. For each dimension, the classification results of two approaches are evaluated: a two-step approach in which dialogue is first automatically segmented and the segments subsequently identified as a specific dialogue act, and an approach in which segmentation and classification is attempted simultaneously. To allow the latter approach, segments with corresponding dialogue act labels are encoded on token-level similar to common chunking tasks.
In the past years, the number of service requests through e-mail has shown an explosive growth.
To cope with the increased numbers of received e-mails, current call centres are "transformed" into so-called contact centres, handling both e-mail and telephone calls.
Personal answering (each e-mail answered by a human-agent), is much too expensive.
To speed up the process and to improve the consistency in the answers, , these contact centres often use of a predifined set of answers to answer similar e-mails that cover most of the e-mail service requests (frequently asked questions).
By using Information Retrieval (IR) techniques, text categorization techniques and language technology, the process of finding the correct answer for a request can be partly automated.
By automatically suggesting a relevant subset of predifined answers (i.e. max 5 answer suggestions) to each incoming e-mail, the contact centre agent only has to pick the correct answer and send the reply.
A practical study using an e-mail corpus of 17,000 incoming e-mails (collected and categorized in a Dutch contact centre), has shown that this approach is able to present the correct answer within a ranked list of 5 possible suggestions, for almost 88% of all incoming e-mails.
This technology is not limited to written context but, once ASR results are sufficiently high, can be used for spoken content as well. We will show that these IR and categorization techniques combined with LVCSR (Large Vocabulary Continuous Speech Recognition), is implemented successfully in handling the first question in most call centers: why do you call us?.
We have applied state-of-the-art text classification methods to two tasks: Categorization of children's speech in the CHILDES Database according to gender and age. Both tasks are binary; for age we distinguish two age groups between the age of 1.9 and 3.0 years old. We approach the study of terms and their frequencies in children's speech from a particular point of view: we have applied various machine learning techniques from the field of document classification to one of the corpora, including k-Nearest Neighbours, Neural Networks, Support Vector Machines, boosting and classifier committees.
These methods were compared for their accuracy with traditional linguistic measures. Standard measures in the study of children's language acquisition are the average length of utterances, types of words, their frequency and Type-Token Ratio. Although these are standard measures, these indicators of (morpho-)syntactic complexity are widely discussed for their reliability.
Both traditional measures and the text classification methods are attested for their feasibility. The text classification methods are based on the bag-of-words approach. We assume that the position of a certain word in a document can be ignored. Each document is then represented by the vector of its own word frequencies.
Our results show that the machine learning text classification methods yield better results in the classification task than the use of traditional linguistic methods. Moreover, we were able to do gender classification with significant results. These phenomena may be of interest to linguists who study language acquisition.
In the area of document classification, so called `topic models' have recently gained quite some attention. A topic model is a model that considers documents as mixtures of topics, a topic being a probability distribution over words. This research will look at the possibility of exploiting such topic models for the automatic discrimination of ambiguous nouns. The key idea is to jointly model dependency triples as well as `bag of words' data according to a topic model. The intuition in this is that a noun's dependency triples can be disambiguated by the topics found by the bag of words approach. The use of three way data allows one to determine which topic(s) are responsible for a certain sense of a word, and adapt the corresponding feature vector accordingly, `subtracting' one sense to discover another one.
Informal political discourse has become an ever more important feature of the intellectual landscape of the Internet, automatic extraction of meaningful information from such exchanges remains a formidable challenge. In this talk we describe our ongoing experiments in classifying texts by the political orientation of the writers by means of sentiment analysis for the informal political domain, and the challenges attendant to such analysis of this kind of data. We describe the ways in which the informal format makes automatic processing difficult and the deeper ways in which participants' rhetorical goals obscure their opinions in the text.
We describe a machine learning approach to the prediction of the personality of an author on the basis of linguistic properties of the text he or she wrote. We collected a corpus of 145 essays written by BA-level students on a single topic and had each student take a personality test providing us with a Meyers-Briggs personality profile. The approach we took is an automatic text categorization approach. First a document representation is constructed based on feature selection from the linguistically analyzed texts. For the linguistic analysis we used the memory-based shallow parser. The document representations using these features are associated with each of the four components of the Meyers-Briggs Type Indicator (Introverted-Extraverted, Sensing-Intuitive, Thinking-Feeling, Judging-Perceiving). This produces four binary classification tasks that are learned using memory-based learning. Results indicate that the first two personality dimensions can be predicted fairly accurately (~ 80%). We discuss the contribution of the different linguistic information sources (lexical, syntactic, token-based) to the overall accuracy.