The CGN lexicon

In version 1.0 of the Spoken Dutch Corpus a version of the CGN lexicon has also been included. This lexicon has been used in the production of the corpus for example for spell checking, lemmatisation of the tokens, assignment of the part of speech, etc. Lexical information has also been used for in the production of the (automatic) phonetic transcriptions and the syntactic annotations.

The CGN lexicon has been based on already existing resources such as CELEX, RBN, PAROLE, FONILEX, Van Dale, de Woordenlijst Nederlandse Taal (Groene Boekje) and the Corpus Uit den Boogaart, and has been further adapted and extended for use with the Spoken Dutch Corpus. The lexicon comprises two parts: a standard lexicon with continuous word forms and a separate lexicon for multiwords. Both lexicons include only those lexical items that can actually be found to occur in the corpus. Excluded are items for which lexical information is irrelevant (eg hesitations and incomplete words). The lexicon files are available in two formats: flat ASCII (text) format) and HTML for consultation by means of a web browser. The files can be found on the annotation DVD in the directories /data/lexicon/text/ and /data/lexicon/html/ More detailed information about the standard lexicon and the multicord lexicon can be found on README_cgnlex7.htm and README_cgnmlex.htm respectively.

The lexicons contain information about the token (word form), part of speech, lemma, syntax orthographic status, pronunciation, morphology and nature (continuous/discontinuous) of a multiword expression.