Version 1.0

In this release the results are available from the Spoken Dutch Corpus project. These results include the recordings and all accompanying transcriptions and annotations, documentation, the lexicon and the COREX exploitation software.

Below an overview is given of the data that are available in this release.
 

Overview of available data with basic transcriptions and annotations

Table 1 presents an overview of the data for which apart from a sound recording also an orthographic transcription is available, a POS tagging, lemmatisation, an automatic broad phonetic transcription and an automatic word segmentation.

Table 1. Overview of data in Version 1
('VL' refers to the Flemish data, 'NL' to the Dutch data)
 
 
Component Total number of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
2,626,172
 878,383 1,747,789
b.
Interviews with teachers of Dutch
565,433
 315,554 249,879
c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633
465,096
743,537
d.
Spontaneous telephone dialogues (recorded on MD with local interface)
853,371
 343,167
510,204
  e.
Simulated business negotations
136,461
 0  136,461
  f. Interviews/ discussions/debates (broadcast)
790,269
250,708  539,561
  g.
(political) Discussions/debates/ meetings (non-broadcast)
360,328
138,819
 221,509
h.
Lessons recorded in the classroom
405,409
105,436
299,973
i.
Live (eg sports) commentaries (broadcast)
208,399
 78,022  130,377
j.
Newsreports/reportages (broadcast)
186,072
 95,206  90,866
k.
News (broadcast)
368,153
 82,855  285,298
l.
Commentaries/columns/reviews (broadcast)
145,553
 65,386  80,167
m.
Ceremonious speeches/sermons
18,075
 12,510  5,565
n.
Lectures/seminars
140,901
 79,067  61,834
o.
Read speech 903,043 351,419 551,624
Total
8,916,272
3,261,628 5,654,644

For more information about


Overview of data with additional transcriptions and annotations

In Tabel 2a and 2b an overview is presented of the additional transcription and annotations that are available for a selection of samples. For those data for which a manually verified broad phonetic transcription is available, the word segmentation has been manually verified. Table 2a concerns the data from the Netherlands, Table 2 the Flemish data. For more information, see meta-data (sample information).

For more information about


Table 2a. Additional transcriptions and annotations (data from The Netherlands)
 
Component Quantity of data (number of words) with a
phonetic transcription
syntactic annotation
prosodic annotation
a.
Spontaneous conversations ('face-to-face')
106,182
300,368
37,406
b.
Interviews with teachers of Dutch
25,687
25,687
7,596
c.
Spontaneous telephone dialogues (recorded via a switchboard)
201,141
69,933
20,070
  d. Spontaneous telephone dialogues (recorded on MD with a local interface)
0
0
0
  e. Simulated business negotiations
25,485
25,485
7,485
  f. Interviews/discussions/debates (broadcast)
75,106
75,106
7,537
  g. (political) Discussions/debates/meetings (non-broadcast)
25,117
25,117
7,678
h.
Lessons recorded in the classroom
 25,961
25,961
0
i.
Live (eg sports) commentaries (broadcast)
24,986
24,986
5,866
j.
Newsreports/reportages (broadcast)
25,065
25,065
5,617
k.
News (broadcast)
25,296
25,384
7,437
l.
Commentaries/columns/reviews (broadcast)
25,071
25,071
7,541
m.
Ceremonious speeches/sermons
5,184
5,184
978
n. Lectures/seminars 14,913 14,913 6,577
o.
Read speech
70,223
0
0
Total
675,417
668,260
121,788

 

Table 2b. Additional transcriptions and annotations (data from Flanders)
 
Component Quantity of data (number of words) with a
phonetic transcription
syntactic annotation
prosodic annotation
a.
Spontaneous conversations ('face-to-face')
70,945
146,745
49,988
b.
Interviews with teachers of Dutch
34,064
34,064
7,667
c.
Spontaneous telephone dialogues (recorded via a switchboard)
68,886
19,886
19,874
  d. Spontaneous telephone dialogues (recorded on MD with a local interface)
6,257
6,257
0
  e. Simulated business negotiations
0
0
0
  f. Interviews/discussions/debates (broadcast)
25,144
25,144
10,007
  g. (political) Discussions/debates/meetings (non-broadcast)
9,009
9,009
5,414
h.
Lessons recorded in the classroom
 10,103
10,103
0
i.
Live (eg sports) commentaries (broadcast)
10,130
10,130
6,002
j.
Newsreports/reportages (broadcast)
7,679
7,679
6,054
k.
News (broadcast)
7,305
7,305
6,248
l.
Commentaries/columns/reviews (broadcast)
7,431
7,431
5,998
m.
Ceremonious speeches/sermons
1,893
1,893
1,124
n. Lectures/seminars 8,143 8,143 3,880
o.
Read speech
64,848
44,144
0
Totaal
331,837
337,933
122,256