The CGN Lexicon

  1 March 2004 
   
  Richard Piepenbrock 
  Mila Groot 
  Raffaela Vlot 
  Maarten Jansonius 

General information

The CGN Lexicon, as it is available in Version 1.0 of the Spoken Dutch Corpus, comprises almost all types (unique word forms) that occur in the corpus. The lexicon only includes words that occur in the corpus and excludes those types for which extensive lexical information is irrelevant. The latter applies to non-words, incomplete words, foreign words, punctuation marks, and inaudible or incomprehensible words.

This lexicon only comprises continuous word forms; multi-word expressions that include blanks can be found in a separate multi-word lexicon. The file name of the multi-word lexicon is cgnmlex.txt. In this (single word) lexicon, all parts of the multi-word expressions have been included as separate entries.

The CGN Lexicon comprises 14 columns, the first 4 of which (Id-Nummer Woordvorm, Orthografie Woordvorm, Woordsoort en Lemma) always contain information. In the column Gebruik only codes for regional and stylistic variants can be found, and in the columns Syntax, Uitspraak (4 subcolumns), Morfologie and Definitie codes occur that originate from (one of) the source lexicons (CELEX (Centrum voor Lexicale Informatie) 1 and RBN (Referentiebestand Nederlands)) 2, or that have been generated on the basis of the pronunciations in CELEX and FONILEX (Fonetisch Lexicon Vlaams) 3.
 

Format and contents of the CGN Lexicon

The lexicon is made available in two formats:
  1. A standard text file (flat ASCII) with the name cgnlex.txt. The backslash ('\') is used as separator. Letters with diacritics are represented in SGML format. This file can be opened by means of a simple text editor, or a database system such as Access, ORACLE or dBase.
  2. An XML file with the name cgnlex.lex. This file can be opened by any XML browser or editor, and then  searched for certain values. The associated DTD (Document Type Definition) lex.dtd is also avaliable.
The lexicon files are ordered according to the orthographic word form (Orthografie Woordvorm), part-of-speech (Woordsoort) and then by Lemma.
Number of word form entries (type - part-of-speech pairs) 181,579
Total number of entries, incl. syntactic patterns 229,104
Number of fields 14

Contents of the lexicon fields

  1. CGN_LEXICON.Id-Nummer Woordvorm ::= [0-9]+

  2. Unique ranknumber (Id = 'identification') for each word form - tag pair. It is not unique per line because for each type-tag combination more than one syntactic complementation pattern may occur. Orthographically identical word forms can occur more than once anyway when they belong to different lemmas, or when within a single lemma they can have different morpho-syntactic codes (eg  'vatten' as infinitive, present tense plural, and past tense plural of the verb  'vatten').
  3. CGN_LEXICON.Orthografie Woordvorm ::= ([0-9][A-Z][a-z][&'-;])+

  4. Orthographic representation of the word form, ie the inflection paradigm of the lemma, to the extent that the inflections occur in the Corpus. Diacritics are represented in SGML format as follows:

    "&" + capital/small letter + diacritic representation + ";"

    In concreto:
    "&" +  "a" + "grave"  + ";" 
    "c"  "acute" (= aigu) 
    "e"  "circ" (= circonflexe) 
    "i"  "uml" (= trema) 
    "n"  "cedil" (= cedille) 
    "o"  "tilde" 
    "u"  "ring" (alleen in de namen 'Åkermans' en 'Ålesund') 
    "A" 
    "C" 
    "E" 
    "I" 
    "N" 
    "O" 
    "U" 
    b.v.  'inconveniëren' voor 'inconveniëren' 
    en 
    'Française' voor 'Française' 

    The SGML symbool '&' is used to represent the ampersand ('&').

  5. CGN_LEXICON.Woordsoort ::=
  6. "ADJ(" value ("," value)* ")" |
    "BW(" ("dial"|"") ")" |
    "LID(" value ("," value)* ") |
    "N(" value ("," value)* ")" |
    "SPEC(afgebr)" |
    "SPEC(deeleigen)" |
    "SPEC(meta)" |
    "SPEC(onverst)" |
    "SPEC(vreemd)" |
    "TSW(" ("dial"|"") ")" |
    "TW(" value ("," value)* ")" |
    "VG(" value ")" |
    "VNW(" value ("," value)* ")" |
    "VZ(" value ("," value)* ")" |
    "WW(" value ("," value)* ")"
    Values for the open word classes according to the document Part of Speech Tagging en Lemmatisering (Van Eynde, 2003):
    ADJ
    adjectief (= adjective)
    BW
    bijwoord (= adverb)
    LID
    lidwoord (= article)
    N
    substantief
    SPEC(afgebr)
    code that is used almost exclusively in the lexicon for parts of contracted multi-word expressions (eg 'in- en uitvoer'); in the corpus it has also been used for all incomplete words
    SPEC(deeleigen)
    code for part of a multi-word proper name
    SPEC(meta)
    code for a word mention
    SPEC(onverst)
    code for an incomprehensible utterance
    SPEC(vreemd)
    code for an utterance in a foreign language or a foreign word
    TSW
    iutssenwerpsel (= interjection)
    TW
    telwoord (= numeral)
    VG
    voegwoord (= conjunction)
    VNW
    voornaamwoord (= pronoun)
    VZ
    voorzetsel (= preposition)
    WW
    werkwoord (= verb)
  7. CGN_LEXICON.Lemma ::= ([0-9][A-Z][a-z][&'-;_])+

  8. Orthographic representation of the lemma, d.w.z. the netry that serves as characterisation of the complete inflection paradigm. Diacritics are represented as described above (word form). The default lemma value for word forms in the word class 'SPEC' is an underscore.
  9. CGN_LEXICON.Id-Nummer Lemma: ::= [0-9]+

  10. Rank number (Id = 'identification') which indicates which word forms belong to one and the same paradigm. The fact that orthographically indentical lemmas can have different Id Numbers implies that lemmas are involved with different morpho-syntactic characteristics (eg different gender with 'het blik' and 'de blik', different part-of-speech with 'het leven' and 'wij leven', and different derivational morphology with 'koker' ('cilinder' vs. 'iemand die kookt')), or different pronunciation, as in 'band' ('material': /bAnt/ vs. 'band': /bEnt/). These different characteristics must lead to a distinction in meaning; where this is not the case (as with 'de matras' and 'het matras' the word form is treated as a single lemma. The meaning distinction is indicated in the field Definitie.
  11. CGN_LEXICON.Syntax

  12. The syntactic complementation patterns per word form. For each word form multiple patterns are possible. These are represented in separate (successive) records. The patterns have been derived from the cross-section of CELEX and RBN. Patterns that after conversion only occurred in CELEX or RBN have not been included. The values used are conform those in the document CGN Syntactische Annotatie (Hoekstra et al. 2004).
  13. CGN_LEXICON.Status ::= ("B" | "INF" | "*d" | "*u" | "*v" | "*x" | "*z")("," Status)* | NULL

  14. Status of the word form:
    B = Belgian
    INF = informal
    *d = dialect
    *u = mispronunctiation
    *v = foreign word
    *x = incomprehensible word
    *z = word pronounced with a heavy regional accent, transcribed using normalised spelling
  15. CGN_LEXICON.Uitspraak CGN Nederlands Normaal ::= [+2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

  16. Canonical pronunciation, i.e. representation of (standard) Dutch pronunctiation automatically generated by means of the CGN grapheme-to-phoneme converted 4, trained on the CELEX pronunciation transcription. In this representation no syllables or stress are indicated.
  17. CGN_LEXICON.Uitspraak CGN Vlaams Normaal ::= [*+2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

  18. Canonical pronunciation, i.e. representation of (standard) Flemish pronunciation generated by means of the CGN grapheme-to-phoneme converted, trained on the FONILEX pronunctiation transcription. In this representation no syllables or stress are indicated.
  19. CGN_LEXICON.Uitspraak CGN Vlaams Formeel ::= [+2:@AEGINOSYZ`abdefghijklmnoprstuvwxyz]*

  20. Representation of very formal Flemish pronunciation generated by means of the CGN grapheme-to-phoneme converted, trained on the FONILEX pronunctiation transcription. In this representation no syllables or stress are indicated.
  21. CGN_LEXICON.Uitspraak CELEX ::= ['+-2:@AEGIJNOSYZabdefghijklmnoprstuvwxyz~]*

  22. Canonical pronunciation, i.e. representation of the word form incl. syllable boundaries and main stress (in so far as available in CELEX). This representation only shows assimilations which lead to changes on the phoneme level (as eg Auslautverhärtung ("paard": /'part/) and  regressive assimilation en degemination ("inboedel": /'Im-bu-d@l/; "bloeddruk": /'blu-drYk/) and therefore can be characterised as phonemic (ie on a level between the phonological and the phonetic level).

    The representation is in the CGN phoneme set, incl. the palatal nasal /J/.

  23. CGN_LEXICON.Morfologie

  24. Hierarchical morphological segmentation of the lemma. This representation concerns the lemma and therefore only comprises derivational and compositional morphology (not a characterisation of the inflectional characteristics of the word form). The representation is redundant to the extent that for each word form the morphological representation of the lemma is repeated. The different levels of segmentation (from the full lemma to the morphemes) are represented by a bracketing in which with each morpheme the part of speech is indicated between anled brackets. Bound morphemes (affixes) are indicated by means of periods, or the letter 'x' if the affix is discontinuous (together with a period in the other part).

    Overview of part-of-speech codes:

    The role of the affix in the derivation or composition is indicated by means of a vertical bar, where the part of speech following the bar refers to the parts fo speech of the morphemes that serve as input for the morphological process, and the part of speech preceding the bar indicates the part of speech of the output of the morphological process (viz. the part of speech that is composed of the complex morpheme that has been formed by means of other morphemes). Thus '[N|A.]' with 'arrogantie' represents the process of affixing in which the adjective can be transformed to a noun by means of the suffix '-ie':
    ((arrogant)[A],(ie)[N|A.])[N]
    Examples of morphological segmentation:
    boek:
    (boek)[N] (nl. monomorfematisch)
    telraam:
    ((tel)[V],(raam)[N])[N]
    hondenhok:
    ((hond)[N],(en)[N|N.N],(hok)[N])[N]
    onmondig:
    ((on)[A|.A],((mond)[N],(ig)[A|N.])[A])[A]
    gehemelte:
    ((ge)[N|.Nx],(hemel)[N],(te)[N|xN.])[N]
    arbeidsovereenkomst:
    ((arbeid)[N],(s)[N|N.N],(((overeen)[B],(kom)[V])[V],(st)[N|V.])[N])[N]
  25. CGN_LEXICON.Corpus Status ::= ( "C" | "I" | "O" | "V" ) | NULL

  26. Code that indicates the orthographical statsu of the types that are encountered in the corpus: Once the spelling of the word form has been validated, the lexicon entry receives the code V (validated). An incorrect spelling is indicated by means of code I (incorrect). Where (for lack of time within the project) validation of the word form was not possible, no verdict was passed on the correctness of the word form. Insytead, word form received the neutral label  O (non-validated). The code C (correct) was used for alternative, correct lemmatisations of instances that were labelled I, O or V. For example,
    396259\asielaanvragen\N(soort,mv,basis)\asielaanvrage\133817\C\
    392625\asielaanvragen\N(soort,mv,basis)\asielaanvraag\131545\V\
  27. CGN_LEXICON.Definitie

  28. For all lemmas that have been included more than once with one and the same part of speech (because they had distinctive formal characteristics such as morpho-syntactic characteristics, gender or derivational morphology) together with a difference in meaning, a compact definition has been included so as to distinguish between the lemmas. For example,
    73704\doorlopen\WW(inf,vrij,zonder)\doorlopen\23802\dor-'lo-p@\V\bewegen door, tot het einde volgen\
    73705\doorlopen\WW(inf,vrij,zonder)\doorlopen\501446\'dor-lo-p@\V\verder lopen, vermengen van kleuren\

1  Centrum voor Lexicale Informatie. Interfacultaire Werkgroep Taal en Spraak, Universiteit van Nijmegen & Max Planck Instituut voor Psycholinguïstiek, Nijmegen.

2  Referentiebestand Nederlands. Vakgroep Lexicologie, Vrije Universiteit Amsterdam & Instituut voor Nederlandse Lexicologie, Leiden & Departement Linguïstiek, Katholieke Universiteit Leuven & Vakgroep Nederlands, Universiteit Utrecht.

3  FONILEX. Centre for Computational Linguistics, Katholieke Universiteit Leuven & Centrum voor Nederlandse Taal en Spraak, Universiteit Antwerpen & Vakgroep voor Electronica en Informatiesystemen, Universiteit Gent

4  CGN grapheme-to-phoneme converted. See:
Véronique Hoste, Steven Gillis & Walter Daelemans (Universiteit Antwerpen), A Rule Induction Approach to Modeling Regional Pronunciation Variation. In: Proceedings of COLING 2000, Saarbrücken, Germany. San Francisco: Morgan Kaufman Publishers, 2000, pp. 327-333.
and:
Véronique Hoste, Steven Gillis & Walter Daelemans, Machine Learning for Modeling Dutch Pronunciation Variation. Proceedings of the tenth CLIN meeting, Utrecht, The Netherlands.