PELICAN - Parser for English Linguistics using Immediate Constituent Analysis

Introduction

The PELICAN is an English wide-coverage rule-based parser. It originates from research in the field of corpus linguistics where efforts were directed at developing tools for the linguistic annotation of corpora. The aim was to develop a parser that would yield linguistically motivated, deep analyses that would be useful for linguists involved in English descriptive linguistics.

Parser input

The parser takes as input raw text. With the current version of the parser it is assumed that prior to parsing a text it has been tokenized in such a fashion that it has been split up into sentences or similar units. Initial capitalization must be removed in all cases where the common lexicon entry would list the word with an initial small letter. With lexical items such as names that are spelled with an initial capital letter, the capital letter is retained. Where, as for example in the case of headlines, words or names are fully capitalized all but the first letter is to be converted to small letters. An exception is formed by acronyms. Here all capital letters are retained.

Parser output

The parser has been designed to yield at least the contextually appropriate parse for a sentence in a given context. However, since the parser does not make use of any semantic or pragmatic information, in addition to the contextually appropriate parse there may be alternative parses that might be considered appropriate in a different context.

It is possible to use the parser with different settings. Thus other than yielding all possible analyses, one of the options is to have the parser yield a specified maximum number of analyses. In the latter case it need no longer be the case that contextually appropriate analysis is included in the restricted set.

Representations

Parse trees take the form of labelled bracketings. Alternatively, an indent format can be produced. The immediate parser output is constituted by derivations in which one can trace each and every rewriting that takes place to arrive at a particular analysis. For most practical applications this output is quite redundant. Therefore a filter is applied which filters out any information deemed irrelevant to any other purpose than grammar development. The filtered output is presented in the indent format. It is possible to import analyses in indent format in the MOOSE tree editor, where they can be viewed as trees and (post-)edited.

PELICAN - Parser for English Linguistics using Immediate Constituent Analysis

Introduction

Navigation

Navigation