Syntax

Part of the corpus was annotated syntactically. For the annotation use was made of the Annotate program which has been developed by the University of Saarbrücken.

Below we discuss the syntactic annotation of the Spoken Dutch Corpus and the aim and motivation for this type of annotation. We also describe the protocol that was developed and the procedure that was adopted. Information is also included about the file types and formats that are available. Finally, an overview is presented of the data that are available in version 1.0 of the corpus.
 

Read more about




Aim and motivation

The syntactic annotation of the data was based on the following ideas (cf. Hoekstra et al. 2003: 4): Input: On the input side we wanted the annotation schemes to be as simple as possible so as to keep the workload of the annotation and corretion of data down to a minimum.
Output: On the output side we wanted to offer as rich an annotation as possible, in a format that could take on different forms for various user groups.

In order to achieve this goal we decided to aim for a dependency analysis which was to a large extent theory-neutral. The primary annotation can be enriched with POS information and, through the lexicon link-up, with information from the CGN lexicon. The combination of the information contained in these three resources makes it possible to yield output that meets the needs of various user groups.

Return to the top of this page.


Procedure

In order to facilitate and speed up the annotation process, use was made of the Annotate software that has been developed at the University of Saarbrücken. The Flemish data were annotated in Leuven (CCL), for the Netherlands the syntactic annotation was done by OTS, Utrecht. In the annotation process data were annotated in multiple passes: after a first annotation was produced, the data had to pass through a number of correction cycles. Checks were also made to ensure consistency.
 
 

Return to the top of this page.

Protocol

For the syntactic annotation of the corpus a protocol was developed:
 

Hoekstra H., M. Moortgat, B. Renmans, M. Schouppe, I. Schuurman & T. van der Wouden. 2003. CGN Syntactische annotatie (Here available in .pdf format.)
 
 

Return to the top of this page.

File types and formats

The syntactic annotations have been stored in the following files:

For the formats mentioned above, separate descriptions are available:
Return to the top of this page.


Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.
 

Table 1. Overview of the data for which a syntactic annotation is available
(VL = data originating from Flanders; NL = data originating from The Netherlands)
 
Component Total number 
of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
447,113
 146,745 300,368
b.
Interviews with teachers of Dutch
59,751
 34,064 25,687
c.
Spontaneous telephone dialogues (recorded via a switchboard)
89,819
19,886
69,933
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
 6,257
0
  e.
Simulated business negotiations
25,485
 0  25,485
  f. Interviews/discussions/debates (broadcast)
100,250
25,144  75,106
  g.
(political) Discussions/debates/meetings (non-broadcast)
34,126
9,009
 25,117
h.
Lessons recorded in the classroom
36,064
10,103
25,961
i.
Live (eg sports) commentaries (broadcast)
35,116
10,130  24,986
j.
Newsreports/reportages (broadcast)
32,744
 7,679  25,065
k.
News (broadcast)
32,689
 7,305  25,384
l.
Commentaries/columns/reviews (broadcast)
32,502
 7,431  25,071
m.
Ceremonious speeches/sermons
7,077
1,893  5,184
n.
Lectures/seminars
23,056
 8,143  14,913
o.
Read speech  44,144   44,144 0
Total
1,006,193
337,933 668,260

 

Return to the top of this page.