Lemmatisation

The entire corpus was lemmatised. For the lemmatisation a lemmatiser was used. The output was checked and where necessary corrected manually.

Below the lemmatisation of the data in the Spoken Dutch Corpus is described in some detail. Attention is given to the aims pursued, the protocol that was developed as well as the procedure that was adopted. We also provide information with regard to the file types and formats. Finally, an overview is given of the data that are available in version 1.0.
 

Read more about




Aim and motivation

The addition of lemma information with the individual tokens in the corpus was meant to help users search for related word forms. The orthographic tokens formed the unit of annotation. No attempt has been made to relate the parts of possibly discontinuous verbs or prepositions. In a similar fashion the parts of multi-word proper names have been lemmatised token by token.
 
Return to the top of this page.


Procedure

In order to facilitate the lemmatisation process a lemmatiser was used that had been developed at the University of Tilburg. The output of the lemmatiser was checked and where necessary corrected manually.
 
Return to the top of this page.

Protocol

There is no separate protocol voor lemmatisation. Instead, there is a combined protocol in which POS tagging is described as well as lemmatisation:

Van Eynde, F. 2003. Protocol voor POS tagging en lemmatisering. (Here available in .pdf format; Dutch only.)
 
 

Return to the top of this page.

File types and formats

The lemmatisation is stored together with the POS tagging in the following files:

For descriptions of the aforementioned formats, see the descriptions of the plk format and the tag format.
 
Return to the top of this page.


Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.
 

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)
 
 
Component Total number
of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
2,626,172
878,383 1,747,789
b.
Interviews with teachers of Dutch
565,433
315,554 249,879
c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633
465,096
743,537
d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
343,167
510,204
e.
Simulated business negotiations
136,461
0 136,461
  f. Interviews/discussions/debates (broadcast)
790,269
250.708 539.561
g.
(political) Discussions/debates/meetings (non-broadcast)
360,328
138,819
221,509
h.
Lessons recorded in the classroom
405,409
105,436
299,973
i.
Live (eg sports) commentaries (broadcast)
208,399
78,022 130,377
j.
Newsreports/reportages (broadcast)
186,072
95,206 90,866
k.
News (broadcast)
368,153
82,855 285,298
l.
Commentaries/columns/reviews (broadcast)
145,553
65,386 80,167
m.
Ceremonious speeches/sermons
18,075
12,510 5,565
n.
Lectures/seminars
140,901
79,067 61,834
o.
Read speech 903,043 351,419 551,624
Total
8,916,272
3,261,628 5,654,644

Return to the top of this page.



Frequency information

On the basis of the lemmatised data in the corpus an alphabetical frequency list was compiled, listing the lemmas and their frequencies together with the word forms and tags that are associated with these. The frequency list (lemalph.frq) can be found in the directory /data/lexicon/freqlists of the annotation DVD. A description of the way the list is structured can be found on ../../lexicon/freq_lst.htm
Return to the top of this page.