The CGN Multi-word lexicon

  1 March 2004 
   
Richard Piepenbrock 
Mila Groot 
Raffaela Vlot 
  Maarten Jansonius 

General information

The CGN Multi-word lexicon, as available in version 1.0 of the Spoken Dutch Corpus, is based on an inventory of all multi-word expressions that occur in a number of sources (CELEX 1, RBN 2, Woordenlijst Nederlandse Taal (Groene Boekje, 1995), Corpus Uit den Boogaart 3) and the Van Dale Groot Woordenboek der Nederlandse Taal 4), complemented with all multi-word expressions that were encountered in the Corpus. The lexicon only includes multi-word expressions that occur in the corpus.

Format and contents of the CGN Multi-word lexicon

The lexicon is available in two formats:

  1. A standard text file (flat ASCII) with the name cgnlex.txt. The backslash ('\') is used as separator. Letters with diacritics are represented in SGML format. This file can be opened by means of a simple text editor, or a database system such as Access, ORACLE or dBase.
  2. An XML file with the name cgnmlex.lex. This file can be opened by any XML browser or editor, and then searched for certain values. The associated DTD (Document Type Definition) mlex.dtd is also avaliable.
The Multi-word lexifcon comprises 11 columns. Both lexicon files have been ordered occording to the orthographic multi-word (Orthografie Meerwoord) and then the multi-word part of speech (Woordsoort Meerwoord), the ID number of the multi-word lemma (Id-Nummer Meerwoordslemma) and the rank number of the members of the multi-word expression (Volgnummer van de leden binnen de meerwoordsuitdrukking).
Number of unique multi-word expressions 23,567
Number of unique multi-word lemmas 18,593
Number of entries for multi-words 53,704

Contents of the lexicon fields

  1. CGN_MLEXICON.Orthografie Meerwoord ::= ([0-9][A-Z][a-z][ &'*-;])+

  2. Orthographic representation of the multi-word expression. The inflection paradigm for the multi-word lemma has been included here to the extent that inflected forms occur in the corpus. Diacritics are represented in SGML format as follows:

    "&" + capital/small letter + diacritic + ";"

    In concreto:
    "&" +  "a" + "grave"  + ";" 
    "c"  "acute" (= aigu) 
    "e"  "circ" (= circonflexe) 
    "i"  "uml" (= trema) 
    "n"  "cedil" (= cedille) 
    "o"  "tilde" 
    "u"  "ring" 
    "A" 
    "C" 
    "E" 
    "I" 
    "N" 
    "O" 
    "U" 
    b.v.  'à la carte' voor 'à la carte' 
    en 
    'Gustaf Åkermans' voor 'Gustaf Åkermans' 

    The SGML symbol '&' is used to represent the ampersand ('&').

  3. CGN_MLEXICON.Volgnummer ::= [1-9]+

  4. This number indicates the position of the word form in the sentence relative to the other parts of the multi-word expression.
  5. CGN_MLEXICON.Orthografie Woordvorm ::= ([0-9][A-Z][a-z][&'-;])+

  6. Orthographic representation of the word form, i.e. the individual parts of the multi-word expression Diacritics are represented as described above.
  7. CGN_MLEXICON.Woordsoort Woordvorm ::=

  8. The part of speech of the word form, i.e. the individual parts of the multi-word expression.
    "ADJ(" value ("," value)* ")" |
    "BW("")" |
    "LID(" value ("," value)* ") |
    "N(" value ("," value)* ")" |
    "SPEC(deeleigen)" |
    "SPEC(meta)" |
    "SPEC(onverst)" |
    "SPEC(vreemd)" |
    "TSW()" |
    "TW(" value ("," value)* ")" |
    "VG(" value ")" |
    "VNW(" value ("," value)* ")" |
    "VZ(" value ")" |
    "WW(" value ("," value)* ")"
    Values for the open word classes are conform the document Part of Speech Tagging en Lemmatisering (Van Eynde 2003):
    ADJ
    adjectief (= adjective)
    BW
    bijwoord (= adverb)
    LID
    lidwoord (= article)
    N
    substantief (= noun)
    SPEC(deeleigen)
    code for part of a compound proper name
    SPEC(meta)
    code for a mention
    SPEC(onverst)
    code for an incomprehensible utterance
    SPEC(vreemd)
    code for an utterance in a foreign language
    TSW
    tussenwerpsel (= interjection)
    TW
    telwoord (= numeral)
    VG
    voegwoord (= conjunction)
    VNW
    voornaamwoord (= pronoun)
    VZ
    voorzetsel (= preposition)
    WW
    werkwoord (= verb)
  9. CGN_MLEXICON.Woordsoort Meerwoord ::=

  10. The part of speech of a multi-word expression where from a grammatical point of view the full expression can be regarded as one word. Values are the same as those found with the word form. In addition we find
    COMB(eigen)

    code for compound proper name or title

    Warning: this field has only been included in the text version of the lexicon, viz. cgnmlex.txt (and not in the XML version cgnmlex.lex). It is a provisional code that may be subject to change in the future.

  11. CGN_LEXICON.Id-Nummer Meerwoordslemma: ::= [0-9]+

  12. Rank num ber (Id = 'identification') which indicates which mutli-word expressions belong to one and the same paradigm. The distinction is obly relevant for possibly discontinuous verbs. Where orthographically identical (multi-word) lemmas occur with different ID numbers this implies that lemmas are involved with different morpho-syntactic (eg strong or weak declension) or phonetic (eg stress) characteristics, in combination with a difference in meaning. The difference in meaning is indicated in the field  Definitie Meerwoordslemma.
  13. CGN_MLEXICON.Meerwoordslemma ::= ([0-9][A-Z][a-z][&'*-;_])*

  14. For lemmas of multi-word expressions such as 'uitademen' in multi-word instances like '(ik) adem uit'. With continuous multi-word expressions, viz. fully integrated foreign expressions, compound proper  names and title, a dummy lemma form is postulated which is identical to the expression (the parts are linked by means of underscores). For example,
    pro forma\1\pro\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\
    pro forma\2\forma\SPEC(vreemd)\BW()\615782\pro_forma\\\N\J\
    Kim Clijsters\1\Kim\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J\
    Kim Clijsters\2\Clijsters\SPEC(deeleigen)\COMB(eigen)\608084\Kim_Clijsters\\\J\J
  15. CGN_LEXICON.Morfologie Meerwoordslemma

  16. Hierarchical morphological segmentation of the multi-word lemma. This representation concerns the multi-word lemma and only comprises derivational and compositional morphology (no characterisation is given of the inflectional characteristics of the word form). The morphological segmentation is only relevant for possibly discontinuous verbs. The representation is redundant in the sense that for each word form the morphological representation is repeated. The differenr levels of segmentation (from the full multi-word lemma to its morphemes) is represented in the form of a bracketing. The part of speech for each morpheme is indicated between angled brackets. Bound morphemes (affixes) have been indicated by means of periods, or the letter 'x' in case the affix is discontinuous (together with a period before the other part).

    Overview of word class codes:

    The role of the affix in the derivation or composition is indicated by means of a vertical bar, where the part of speech following the bar refers to the parts fo speech of the morphemes that serve as input for the morphological process, and the part of speech preceding the bar indicates the part of speech of the output of the morphological process (viz. the part of speech that is composed of the complex morpheme that has been formed by means of other morphemes). Thus '[V|.A]' with 'voorverwarmen' represents the process of affixing in which the adjective can be transformed to a verb by means of the prefix 'ver-':
    voorverwarmen ((voor)[B],((ver)[V|.A],(warm)[A])[V])[V]
    Examples of morphological segmentation:
    dichtmaken:
    ((dicht)[A],(maak)[V])[V]
    navertellen:
    ((na)[P],((ver)[V|.V],(tel)[V])[V])[V]
    achteruitdeinzen:
    (((achter)[B],(uit)[B])[B],(deins)[V])[V]
  17. CGN_LEXICON.Definitie Meerwoordslemma

  18. For all multi-word lemmas that have been included more than once with one and the same part of speech (because they had distinctive formal characteristics such as morpho-syntactic characteristics, gender or derivational morphology) together with a difference in meaning, a compact definition has been included so as to distinguish between the lemmas. This field is only relevant for possibly discontinuous verbs Cases of such ambiguity will not occur within the lexicon, but do occur in a comparison with the single word lexicon cgnlex.txt. For example,
    loopt door\WW(pv,tgw,met-t)\501446\doorlopen\((door)[B],(loop)[V])[V]\verder lopen, vermengen van kleuren\J\N\
  19. CGN_MLEXICON.Optioneel lid ::= "J" | "N"

  20. If the word form (Woordvorm) is an optional part of a multi-word expression then  the value of this field is  'J'. Is the word form (Woordvorm) obligatory part of a multi-word expression, then the value of the field is  'N'. Thus 'ademt' as part of 'inademen' and 'uitademen' has the value  'J', while 'apen'  as part of 'na-apen' receives the value 'N'.
  21. CGN_MLEXICON.Continu meerwoord ::= "J" | "N"

  22. If the multi-word expression cannot be interrupted (by constituents other than hesitations or interjections), as for example  'Tien Voor Taal' or 'per se', the multi-word expression as a whole is given the value  'J', or else 'N', as in the case of possibly discontinuous verbs.

1  Centrum voor Lexicale Informatie. Interfacultaire Werkgroep Taal en Spraak, Universiteit van Nijmegen & Max Planck Instituut voor Psycholinguïstiek, Nijmegen.

2  Referentiebestand Nederlands. Vakgroep Lexicologie, Vrije Universiteit Amsterdam & Instituut voor Nederlandse Lexicologie, Leiden & Departement Linguïstiek, Katholieke Universiteit Leuven & Vakgroep Nederlands, Universiteit Utrecht.

3  Boogaart, P.C. Uit den (1975). Woordfrequenties: in Geschreven en Gesproken Nederlands. Utrecht: Oosthoek, Scheltema & Holkema. Electronic version avaliable as part of the  Eindhoven Corpus.

4  Geerts, G. & T. den Boon (1999). Van Dale Groot Woordenboek der Nederlandse Taal. Utrecht/Antwerpen: Van Dale Lexicografie.