The CGN Lexicon

	1 March 2004

	Richard Piepenbrock
	Mila Groot
	Raffaela Vlot
	Maarten Jansonius

General information

The CGN Lexicon, as it is available in Version 1.0 of the Spoken Dutch Corpus, comprises almost all types (unique word forms) that occur in the corpus. The lexicon only includes words that occur in the corpus and excludes those types for which extensive lexical information is irrelevant. The latter applies to non-words, incomplete words, foreign words, punctuation marks, and inaudible or incomprehensible words.

This lexicon only comprises continuous word forms; multi-word expressions that include blanks can be found in a separate multi-word lexicon. The file name of the multi-word lexicon is cgnmlex.txt. In this (single word) lexicon, all parts of the multi-word expressions have been included as separate entries.

The CGN Lexicon comprises 14 columns, the first 4 of which (Id-Nummer Woordvorm, Orthografie Woordvorm, Woordsoort en Lemma) always contain information. In the column Gebruik only codes for regional and stylistic variants can be found, and in the columns Syntax, Uitspraak (4 subcolumns), Morfologie and Definitie codes occur that originate from (one of) the source lexicons (CELEX (Centrum voor Lexicale Informatie) ¹ and RBN (Referentiebestand Nederlands)) ², or that have been generated on the basis of the pronunciations in CELEX and FONILEX (Fonetisch Lexicon Vlaams) ³.

Format and contents of the CGN Lexicon

The lexicon is made available in two formats:

A standard text file (flat ASCII) with the name cgnlex.txt. The backslash ('\') is used as separator. Letters with diacritics are represented in SGML format. This file can be opened by means of a simple text editor, or a database system such as Access, ORACLE or dBase.
An XML file with the name cgnlex.lex. This file can be opened by any XML browser or editor, and then searched for certain values. The associated DTD (Document Type Definition) lex.dtd is also avaliable.

The lexicon files are ordered according to the orthographic word form (Orthografie Woordvorm), part-of-speech (Woordsoort) and then by Lemma.

Number of word form entries (type - part-of-speech pairs)	181,579
Total number of entries, incl. syntactic patterns	229,104
Number of fields	14

Contents of the lexicon fields

CGN_LEXICON.Id-Nummer Woordvorm ::= [0-9]+

CGN_LEXICON.Orthografie Woordvorm ::= ([0-9][A-Z][a-z][&'-;])+

"&" + capital/small letter + diacritic representation + ";"

In concreto:

"&" + "a" + "grave" + ";"

"c" "acute" (= aigu)

"e" "circ" (= circonflexe)

"i" "uml" (= trema)

"n" "cedil" (= cedille)

"o" "tilde"

"u" "ring" (alleen in de namen 'Åkermans' en 'Ålesund')

"A"

"C"

"E"

"I"

"N"

"O"

"U"

b.v. 'inconveniëren' voor 'inconveniëren'

en

'Française' voor 'Française'

The SGML symbool '&' is used to represent the ampersand ('&').

CGN_LEXICON.Woordsoort ::=

"ADJ(" value ("," value)* ")" |

"BW(" ("dial"|"") ")" |

"LID(" value ("," value)* ") |

"N(" value ("," value)* ")" |

"SPEC(afgebr)" |

"SPEC(deeleigen)" |

"SPEC(meta)" |

"SPEC(onverst)" |

"SPEC(vreemd)" |

"TSW(" ("dial"|"") ")" |

"TW(" value ("," value)* ")" |

"VG(" value ")" |

"VNW(" value ("," value)* ")" |

"VZ(" value ("," value)* ")" |

"WW(" value ("," value)* ")"

Part of Speech Tagging en Lemmatisering

ADJ

adjectief (= adjective)

BW

bijwoord (= adverb)

LID

lidwoord (= article)

N

substantief

SPEC(afgebr)

code that is used almost exclusively in the lexicon for parts of contracted multi-word expressions (eg 'in- en uitvoer'); in the corpus it has also been used for all incomplete words

SPEC(deeleigen)

code for part of a multi-word proper name

SPEC(meta)

code for a word mention

SPEC(onverst)

code for an incomprehensible utterance

SPEC(vreemd)

code for an utterance in a foreign language or a foreign word

TSW

iutssenwerpsel (= interjection)

TW

telwoord (= numeral)

VG

voegwoord (= conjunction)

VNW

voornaamwoord (= pronoun)

VZ

voorzetsel (= preposition)

WW

werkwoord (= verb)

CGN_LEXICON.Lemma ::= ([0-9][A-Z][a-z][&'-;_])+

CGN_LEXICON.Id-Nummer Lemma: ::= [0-9]+

Definitie

CGN_LEXICON.Syntax

CGN Syntactische Annotatie

CGN_LEXICON.Status ::= ("B" | "INF" | "*d" | "*u" | "*v" | "*x" | "*z")("," Status)* | NULL

B = Belgian
INF = informal
*d = dialect
*u = mispronunctiation
*v = foreign word
*x = incomprehensible word
*z = word pronounced with a heavy regional accent, transcribed using normalised spelling

'B' is a code which originates from the RBN lexicon and which is used for words that can be looked upon as 'characteristic for the Flemish vocabulary'. These can be words that are only commonly used in Flanders (eg 'frigo' and 'jobstudent') as well as words that are common Standard Dutch but which have a different meaning in Flanders than they have in the general meaning in the whole of the Dutch language speaking area (eg 'aardig' (vreemd) and 'afschrijven' (spieken)).
'INF' is used for words from printed resources (such as Van Dale) which according to the CGN protocols or the judgment of the CGN lexicon staff must be considered to be part of the general vocabulary but are informal, idosyncratic or regionally marked. In the current version the diminutive forms ending in '-ie(s)' (Northern Standard Dutch) and '-ke(n)(s)' (Souther Standard Dutch) are not considered to be dialectic, but they receive the code 'INF':

bakkie\N(soort,ev,dim,onz,stan)\bakkie\INF\
beessie\N(soort,ev,dim,onz,stan)\beest\INF\

'*d' is used for words that transcribes and lexicographers consider to be dialect. For example:

benne\WW(pv,tgw,mv)\zijn\*d\

CGN_LEXICON.Uitspraak CGN Nederlands Normaal ::= [+2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

⁴

CGN_LEXICON.Uitspraak CGN Vlaams Normaal ::= [*+2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

CGN_LEXICON.Uitspraak CGN Vlaams Formeel ::= [+2:@AEGINOSYZ`abdefghijklmnoprstuvwxyz]*

CGN_LEXICON.Uitspraak CELEX ::= ['+-2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

The representation is in the CGN phoneme set, incl. the palatal nasal /J/.

CGN_LEXICON.Morfologie

Overview of part-of-speech codes:

N = substantief (= noun)
A = adjectief (= adjective)
Q = telwoord (= numeral)
V = werkwoord (= verb)
D = lidwoord (= article)
O = voornaamwoord (= pronoun)
B = bijwoord (= adverb)
P = voorzetsel (= preposition)
C = voegwoord (= conjunction)
I = tussenwerpsel (= interjection)
X = restcategorie (= rest category)
. = affix (= affix)
x = deel van discontinu affix (= part of discontinuous affix)

((arrogant)[A],(ie)[N|A.])[N]

boek:

(boek)[N] (nl. monomorfematisch)

telraam:

((tel)[V],(raam)[N])[N]

hondenhok:

((hond)[N],(en)[N|N.N],(hok)[N])[N]

onmondig:

((on)[A|.A],((mond)[N],(ig)[A|N.])[A])[A]

gehemelte:

((ge)[N|.Nx],(hemel)[N],(te)[N|xN.])[N]

arbeidsovereenkomst:

((arbeid)[N],(s)[N|N.N],(((overeen)[B],(kom)[V])[V],(st)[N|V.])[N])[N]

CGN_LEXICON.Corpus Status ::= ( "C" | "I" | "O" | "V" ) | NULL

C = correct spelling of a corpus type
I = incorrect spelling of a corpus type
O = non-validated spelling of a corpus type
V = validated spelling of a corpus type

396259\asielaanvragen\N(soort,mv,basis)\asielaanvrage\133817\C\
392625\asielaanvragen\N(soort,mv,basis)\asielaanvraag\131545\V\

CGN_LEXICON.Definitie

73704\doorlopen\WW(inf,vrij,zonder)\doorlopen\23802\dor-'lo-p@\V\bewegen door, tot het einde volgen\
73705\doorlopen\WW(inf,vrij,zonder)\doorlopen\501446\'dor-lo-p@\V\verder lopen, vermengen van kleuren\

¹ Centrum voor Lexicale Informatie. Interfacultaire Werkgroep Taal en Spraak, Universiteit van Nijmegen & Max Planck Instituut voor Psycholinguïstiek, Nijmegen.

² Referentiebestand Nederlands. Vakgroep Lexicologie, Vrije Universiteit Amsterdam & Instituut voor Nederlandse Lexicologie, Leiden & Departement Linguïstiek, Katholieke Universiteit Leuven & Vakgroep Nederlands, Universiteit Utrecht.

³ FONILEX. Centre for Computational Linguistics, Katholieke Universiteit Leuven & Centrum voor Nederlandse Taal en Spraak, Universiteit Antwerpen & Vakgroep voor Electronica en Informatiesystemen, Universiteit Gent

⁴ CGN grapheme-to-phoneme converted. See:
Véronique Hoste, Steven Gillis & Walter Daelemans (Universiteit Antwerpen), A Rule Induction Approach to Modeling Regional Pronunciation Variation. In: Proceedings of COLING 2000, Saarbrücken, Germany. San Francisco: Morgan Kaufman Publishers, 2000, pp. 327-333.
and:
Véronique Hoste, Steven Gillis & Walter Daelemans, Machine Learning for Modeling Dutch Pronunciation Variation. Proceedings of the tenth CLIN meeting, Utrecht, The Netherlands.

"&" +	"a" +	"grave"	+ ";"
	"c"	"acute" (= aigu)
	"e"	"circ" (= circonflexe)
	"i"	"uml" (= trema)
	"n"	"cedil" (= cedille)
	"o"	"tilde"
	"u"	"ring" (alleen in de namen 'Åkermans' en 'Ålesund')
	"A"
	"C"
	"E"
	"I"
	"N"
	"O"
	"U"
b.v.	'inconveniëren' voor 'inconveniëren'
	en
	'Française' voor 'Française'