Lexicon


The present implementation of the parser assumes that a lexicon is available that will provide all relevant information. From previous experience we had learned that we should not expect to find a publicly available, ready-made lexicon that would suit our needs. Therefore we decided to undertake the compilation of a lexicon, one that was designed specifically to operate in tandem with our rule-based, wide-coverage parser.

Our concerns in constructing the lexicon were the following:

- to provide adequate and reliable information

- to avoid ambiguity arising from the lexical level where this is not carried by the syntactic structure

- to achieve sufficient coverage

In building the lexicon the following options were open to us. Either we could aim for a word form lexicon or we could develop a stem lexicon in which case we would then also need a morphological analyzer. We decided to go for the first option as this would give us the possibility to use different resources, including text corpora and word frequency lists. Moreover, it had been shown that with the AGFL formalism that we employ lexicon look-up is much faster than lexico-morphological analysis.

The specification of the word classes and associated features is given in the grammar underlying the parser. The major word classes that are distinguished are essentially no different from the ones that we usually see in an EAGLES conformant set. However, provisions have been made so that specific words or sets of words that play a pivotal role in the syntactic structure or display deviant behaviour from other members of the same word class can be addressed separately. Examples are cleft it, provisional it, formal it, existential there, the proform so, the adjectives such and worth, negative adverbs (never, not), nominal adjectives, and phrasal adverbs. In the case of subordinating conjunctions we have found that it is of crucial importance to distinguish between those that can introduce a subordinate clause in a particular functional role (adverbial, direct object, noun phrase postmodifier, postmodifier in an adjective phrase, etc.). And while we recognize the class of prepositions as a separate word class, at the same time we provide for the possibility to address the individual members of this class separately. To some extent also semantic subclasses are distinguished, as in the case of nouns where there is a distinction between common and proper nouns, but also between nouns expressing for example time or place.

As regards the procedure that should be adopted for acquiring content for the lexicon, it appeared that preferably different strategies should be followed. Thus for closed word classes a highly restrictive approach was taken where the content was primarily based on the information found in grammatical handbooks such as A Comprehensive Grammar of English (Quirk et al. 1985); all candidate items were manually validated. For the open classes, we compiled joint lists in which from various available sources word types were included with whatever associated information was available that was considered useful (such as POS, frequency, distribution of number of texts). Sources included word form lexicons (such as Moby words), corpora (like the BNC and Reuters, but also smaller corpora like the Brown Corpus, the LOB Corpus and ICE-GB), concordances and word frequency lists. Frequency information was used in an attempt to exclude apparent spelling errors and other obvious flaws (including tokenization errors and POS errors). Items that make up named entities, i.e. person and geographical names, addresses, dates, numbers, measurements, etc., were excluded from the lexicon and are dealt with by a separate module.