|Computational Linguistics in the Netherlands (CLIN) 2007|
Abstracts parallel session 3
Identifying Nominalisation Compounds
This paper investigates the automatic detection and identification of nominalisations. The identification methods extend research carried out by Lapata (2002) and Nicholson (2005). They use statistical methods for the interpretation of the argument-verb relation existing between the modifier and head in compound nominalisations. Our basic system achieves an accuracy of 71% on a subject-object classification task, and 58% on a three-fold disambiguation task distinguishing subject, object and prepositional complement relations. Using a database created from Nomlex, Celex and Catvar, our detection method achieves recall of 94.5% and 99.5% on nominalisation detection in sets of compounds retrieved from open text.
We present three methods that identify compounds containing argument verb-relations from these sets, with the goal to increase precision over previous studies. Best performance is achieved when only compounds whose modifier is attested as an argument of the head noun's root verb in the BNC are identified as argument-nominalisation sequences. This method achieves an accuracy of 70% and 76%, over baselines of 42% and 64%, respectively. In additional experiments, we tested methods using selectional restrictions imposed by the root verb (obtained from VerbNet, linked to the modifier using WordNet) or a general compound noun disambiguation system (classifying compounds according to 19 interpretations) to identify argument relations in compounds. Both methods led to an improvement of overall results, but neither was significant. The third method does indicate that the combination with a nominalisation disambiguation system can significantly improve a more general classifier
The language machine Delilah (http://220.127.116.11:8080/Delilah) computes three different semantic representations for Dutch sentences, dubbed Quasi Logical form, Normal Logical Form and Flat Logical Form. The three semantic modes are closely related and interdependent but differ with respect to compositionality, specificity and locality. As for compositionality, they are either the direct result of graph unification (QLF) or produced by post-derivational spell out (NLF and FLF). As for specificity, they make explicit (NLF and FLF) or leave implicit (QLF) the dependencies between operators. As for locality, the dependencies are (FLF) or are not (NLF, QLF) compiled out at the level of individual variables and predicates. Each mode of semantic representation has its own computational advantages. QLF reflects the sentential structure and can be exploited for disambiguation and ranking of readings. NLF represents an instance of classical predicate logic that is well suited for checking semantic equivalence, e.g. in translation. FLF is modeled for on-line inference. Anaphora are interpreted on the base of QLF and FLF.
We discuss the grammatical, logical and computational construal of each of the logical forms and their relations and interdependencies. We will demonstrate the representations and argue why specific properties of a logical form serve certain purposes
A method is described to incorporate bilexical preferences between phrase heads, such as selection restrictions, in a wide-coverage stochastic attribute-value grammar. The bilexical preferences are modelled as association rates which are determined on the basis of a very large parsed corpus (about 500M words). We show that the incorporation of such self-trained preferences improves parsing accuracy significantly
In parse selection, the task is to select the correct syntactic analysis of a given sentence from a set of parses. On the basis of correctly labelled examples, supervised parse selection techniques can be employed to obtain reasonable accuracy. Although parsing has improved enormously over the last few years, even the most successful parsers make very silly, sometimes embarassing, mistakes. In our experiments with a large wide-coverage stochastic attribute-value grammar of Dutch, we noted that the system sometimes is insensitive to the naturalness of the various lexical combinations it has to consider. Although parsers often employ lexical features which are in principle able to represent preferences with respect to word combinations, the size of the training data will be too small to be able to learn the relevance of such features successfully
In this paper, we describe an alternative approach in which we employ pointwise mutual information association scores. The association scores used here are estimated using a very large parsed corpus of 500 million words (27 million sentences). The association scores are incorporated in the maximum entropy disambiguation model using a technique described in Johnson & Riezler (2000), who show how reference distributions can be integrated in a log-linear model. We show that the addition of lexical association scores in the disambiguation model improves overall parsing accuracy.
Because the association scores are estimated on the basis of a large corpus that is parsed by the parser that we aim to improve upon, the technique constitutes a somewhat particular instance of self-training
Clearly, the idea that selection restrictions ought to be useful for parsing accuracy is not new. However, as far as we know this is the first time that automatically acquired selection restrictions have been shown to improve parsing accuracy results. Related research includes Abekawa & Okumura (2006) and Kawahara & Kurohashi (2006) where statistical information between verbs and case elements is collected on the basis of large automatically analysed corpora
In an application such as question answering, extracting semantic relationships between entities is an important step toward finding correct answers. To answer a question such as 'What causes hemorrhoid?', at least two entities should be identified and linked by a causal relationship: an illness entity (e.g., 'hemorrhoid') and a causal entity (e.g., 'excessive pressure on the veins in the pelvic and rectal area'). The automatic extraction of the relationship using patterns requires thorough analysis on a domain corpus, which is aimed at finding possible constructions for that particular relation. To avoid a manual and time consuming corpus analysis, we develop a method of learning relationship patterns which exploits dependency trees and semantic information. This method can be seen as a pre-processing toward the construction of useful patterns
Delilah (http://18.104.22.168:8080/Delilah/) is an NLP system that, among other things, generates well-formed Dutch sentences by using a lexicon and by applying a categorial, unification-based grammar. The lexicon is built as a collection of detailed HPSG-style specifications for all word forms, and can be regarded as a set of entries in a database. For grammar-driven generation, efficient access to the lexicon is crucial, as a word form should only be produced when its lexical specification matches certain constraints specified by the grammar and by the generation algorithm. In a relational model, a lexical specification would be economically encoded by one or more records in different tables and with named attributes and atomic values. In an HPSG-style specification, however, values are typically paths or complex terms. Although it is possible to encode complex terms into records while storing them, and to decode them during retrieval, this is not a logical approach, as the generation algorithm operates on complete lexical specifications, which implies retrieval of complete lexical specifications. In an object-oriented model, however, a lexical specification is encoded by a single object, which is addressable by a unique identifier. Furthermore, objects are better equipped to represent variables, that are typically used in HPSG-style specifications for underspecification, inheritance, or value sharing, as objects merge data and variables into a single entity. While the storage of objects is redundant, they can still efficiently be retrieved using index tables and compression techniques, which have been implemented in ISO Prolog as much as possible
The Open Source Lexical Information Network (OSLIN) is a collection of relational lexical resources, currently under development for Portuguese. The database consists of a large scale full-form lexicon (around 130k lemmas), over which we are building sets of explicit, manually verified (morphological) relations between lemmas. These relations currently include deverbal nouns, orthographic variants, diminutives, deadjectival adverbs, -able adjectives, prefixed verbs, and gentiles, with various others under development. The database currently contains over 30.000 lexical relations
The database itself provides a lookup table for advanced lemmatization of known words - in which not only a inflected forms are related to their lemmas (such as mice to mouse), but various types of derivational forms are related back to their root form - such as demystification to mystify. Since all relations are semantically tagged, this furthermore provides the information that demystification is the negation of the abstract event related to the verb mystify.
But we are attempting to exploit the explicit relations in the database for the construction of a morphological parser which takes semantic and other restrictions into account. The idea is that the complete set of relations between known, lexicalized words provides the ideal base for the prediction which derivational processes are admissible for unknown words.
This talk will outline the modular set-up of the OSLIN database, and demonstrate the existing lexical resources. It will provide results of the lexical relations that have been analyzed in more detail, and sketch the preliminary data obtained for the construction of the semantic morphological parser
Existing rule-based morphological analysis/generation systems for Arabic are mainly following the approach of the systems developed for the languages linguistically in a far distance from Arabic, hence despite relying on long lexical lists or huge data-bases as well as rule table, resulting in low accuracy. The survey of (Al-Sughaiyer and Al-Kharashi, 2004) presents lots of such systems elaborately. However, Arabic morphology whose primary parts-of-speech are traditionally classified as verb, noun, and particle, has a nature of high regular template-pattern behavior in all its verbs and the majority of its nouns. This unique morphological characteristic can lead to systems specifically developed for Arabic and with high accuracy. This is what has been applied by M. Shokrollahi-Far in the system reported first in “A Knowledge-Base on Holy Qur’an for Tagging Arabic Verbs” at 1st International Conference on Digital Communications and Computer Applications, March 2007, Jordan. The recall of the system, almost 500 Kbytes in size, has been reported %97.7 in an evaluation with “Nobi: a Manually Tagged corpus of Holy Qur’an”, a corpus presented by M. Shokrollahi-Far at CLIN 2007, Belgium: Leuven, January 2007. The present paper is going to report the enhancement in the knowledge-base of the system whose 6500 verb patterns have been merged totally into nearly one-tenth occupying a size of just 200 Kbytes keeping the previously reported precision and recall. In future enhancement, as well as verbs, the system is going to tag the morphological information of those Arabic nouns which are not prime ones, but are generated, inflected and transformed in the same way as verbs do
Improving Tokenization of Clitics in Some Statistical Processing Tools for Arabic: AlwAw Coordinating Conjunction as A Case ExampleNahed Abul-Hassan and Salwa Hamada (Ain Shams University)
In this paper, we describe a hypothesis that improves the performance of some statistical processing tools for Arabic (e.g. ASVMTools) in terms of morphological segmentation of clitics. This paper is confined to the investigation of ÇáæÇæ /AlwAw/ (‘and’), as it is the most common cliticized coordinating conjunction in Arabic. This work is motivated by the fact the morphological segmentation of clitics is a key first step in syntactic disambiguation in Arabic. Our hypothesis is that it is possible to raise the performance to 97.4% by a single preprocessing step in input text.
In many computational linguistic applications involving tagging and parsing the lexicon is critically important to the success of the application. Ideally, all the tokens that are encountered in a text are accounted for in the lexicon. In actual practice, dealing with previously unseen text, for a lexicon to attain full coverage is problematic as it is impossible to have in advance a complete inventory of all the words that may be encountered, even for a morphologically ‘poor’ language such as English. No matter how large a lexicon is, it will at best be near-complete. The present paper addresses the question how the coverage of words in previously unseen text may be improved. The adjectives in English are presented here as a case study
Working on the assumption that most new words that are introduced into the language are constructed on the basis of already existing words through the application of word-formation processes, we investigated the role that different word-formation processes play, more specifically in the formation of adjectives in English. An analysis of adjectives in the BNC has led us to hypothesize that there might be a correlation between the word-formation processes on the one hand and the type frequency on the other hand. What we find is that in the case of adjectives compounding is the word-formation process that is most productive. Moreover, compounds are not formed by combining bases at will; rather, a limited set of fairly simple rules apply that restrict the co-occurrence of bases. This makes it feasible to develop an approach for handling compound adjectives more adequately than is currently being done in various approaches to NLP
In the field of computational psycholinguistics, formal models are developed to implement cognitive theories and account for psycholinguistic phenomena. Two such phenomena are lexical semantic priming and the surprisingly high speed of human sentence processing. I present a model that shows these two -apparently unrelated- phenomena to be two sides of the same coin
The speed of sentence processing has been explained by people's ability to accurately predict upcoming words, allowing the cognitive system to prepare for these words and thereby process them faster. Lexical semantic priming has been explained by a representational scheme in which each word corresponds to a vector in high-dimensional semantic space. A stronger semantic relation between two words is reflected in their vectors being closer together in semantic space. The sequential presentation of two strongly related words thereby leads to a short distance travelled through this space, corresponding to a fast response time (i.e., priming)
The model I present consists in a recurrent neural network that processes sentences one word at a time. Its inputs are vectors representing words. These vectors, which are initially random, are adapted as to minimize the fluctuation of activation in the recurrent network. This allows for faster processing and the implicit prediction of upcoming words
As it turns out, the resulting word representations implement the semantic-space theory of lexical semantic priming. This finding suggests that the need for fast sentence processing is responsible for the mental lexicon being organized in a way that causes semantic priming