Lemmatisation

The entire corpus was lemmatised. For the lemmatisation a lemmatiser was used. The output was checked and where necessary corrected manually.

Below the lemmatisation of the data in the Spoken Dutch Corpus is described in some detail. Attention is given to the aims pursued, the protocol that was developed as well as the procedure that was adopted. We also provide information with regard to the file types and formats. Finally, an overview is given of the data that are available in version 1.0.

Aim and motivation

The addition of lemma information with the individual tokens in the corpus was meant to help users search for related word forms. The orthographic tokens formed the unit of annotation. No attempt has been made to relate the parts of possibly discontinuous verbs or prepositions. In a similar fashion the parts of multi-word proper names have been lemmatised token by token.

Return to the top of this page.

Procedure

In order to facilitate the lemmatisation process a lemmatiser was used that had been developed at the University of Tilburg. The output of the lemmatiser was checked and where necessary corrected manually.

Return to the top of this page.

Protocol

There is no separate protocol voor lemmatisation. Instead, there is a combined protocol in which POS tagging is described as well as lemmatisation:

Van Eynde, F. 2003. Protocol voor POS tagging en lemmatisering. (Here available in .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The lemmatisation is stored together with the POS tagging in the following files:

files of type .plk. This is an ASCII format. These files can be found in the directory /data/annot/text/plk/ of the annotation DVD that was published in version 1.0
files of type .tag. The format of these file is XML. These formats can be found in the directory /data/annot/xml/tag/ of the annotation DVD

For descriptions of the aforementioned formats, see the descriptions of the plk format and the tag format.

Return to the top of this page.

Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
2,626,172
878,383 1,747,789

b.
Interviews with teachers of Dutch
565,433
315,554 249,879

c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633

465,096

743,537

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
343,167
510,204

e.
Simulated business negotiations
136,461
0 136,461

f. Interviews/discussions/debates (broadcast)
790,269
250.708 539.561

g.
(political) Discussions/debates/meetings (non-broadcast)
360,328

138,819
221,509

h.
Lessons recorded in the classroom
405,409

105,436

299,973

i.
Live (eg sports) commentaries (broadcast)
208,399
78,022 130,377

j.
Newsreports/reportages (broadcast)
186,072
95,206 90,866

k.
News (broadcast)
368,153
82,855 285,298

l.
Commentaries/columns/reviews (broadcast)
145,553
65,386 80,167

m.
Ceremonious speeches/sermons
18,075
12,510 5,565

n.
Lectures/seminars
140,901
79,067 61,834

o.
Read speech 903,043 351,419 551,624

Total
8,916,272
3,261,628 5,654,644

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	2,626,172	878,383	1,747,789
b.	Interviews with teachers of Dutch	565,433	315,554	249,879
c.	Spontaneous telephone dialogues (recorded via a switchboard)	1,208,633	465,096	743,537
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	853,371	343,167	510,204
e.	Simulated business negotiations	136,461	0	136,461
f.	Interviews/discussions/debates (broadcast)	790,269	250.708	539.561
g.	(political) Discussions/debates/meetings (non-broadcast)	360,328	138,819	221,509
h.	Lessons recorded in the classroom	405,409	105,436	299,973
i.	Live (eg sports) commentaries (broadcast)	208,399	78,022	130,377
j.	Newsreports/reportages (broadcast)	186,072	95,206	90,866
k.	News (broadcast)	368,153	82,855	285,298
l.	Commentaries/columns/reviews (broadcast)	145,553	65,386	80,167
m.	Ceremonious speeches/sermons	18,075	12,510	5,565
n.	Lectures/seminars	140,901	79,067	61,834
o.	Read speech	903,043	351,419	551,624
Total	8,916,272	3,261,628	5,654,644

Return to the top of this page.

Frequency information

On the basis of the lemmatised data in the corpus an alphabetical frequency list was compiled, listing the lemmas and their frequencies together with the word forms and tags that are associated with these. The frequency list (lemalph.frq) can be found in the directory /data/lexicon/freqlists of the annotation DVD. A description of the way the list is structured can be found on ../../lexicon/freq_lst.htm

Return to the top of this page.