The entire corpus was tagged for part-of-speech information. Within the project a tagset was defined which comprises 316 tags. The tagset closely follows the practices of the authoritative Dutch Grammar ANS (Haeseryn et al., 1997). The tagset is conform the EAGLES guidelines and is described in Van Eynde (2003; here available in .pdf format).
For the tagging use was made of a tagger which would assign to each word the most probable tag. The tagger output was checked and where necessary corrected.
Below the aim and motivation for the part-of-speech tagging is outlined. We also describe in general terms the protocol that was developed, the procedure that was adopted, while also information is provided about the file types and formats that are used. An overview is given of the data that are available in version 1.0 of the corpus. Finally, we refer to the frequency list that is available.
Read more about
The enrichment of the corpus with POS information is one of the few forms of annotation that are availbale for the entire corpus.The addition of POS tags makes it possible to investigate the use of words (i.e. orthographic words or rather: tokens). Where many word forms are ambiguous in isolation, this is seldom the case when the word form is used in context. Consider, for example, the word form werk. This can be a noun or a verb. In the context het was zwaar werk it is immediately apparent that the noun interpretation is the correct one, while in the sentence ik werk altijd hard the verb tag is required. The POS tagging makes it possible for the user to search not only for literal strings, but also for specific parts of speech and certain morpho-syntactic characteristics (eg gender, degree, number). It is also possible to look for subclasses of parts of speech such as postnominal adjectives.
The tagset distinguishes the ten major parts of speech that are usually distinguished for Dutch (see eg Algemene Nederlandse Spraakkunst). Because the tagset includes a great amount of detail, the tagset comprises 316 different tags.
Reference
Haeseryn, W., K. Romijn, G. Geerts, J. de Rooij & M. van den Toorn. 1997. Algemene Nederlandse Spraakkunst. Groningen: Nijhoff and Deurne: Wolters Plantyn.
For the assignment of POS tags the folliwing principles were adopted:
All output was manually verified. For the verification a tag selection program was used that had been developed at the University of Nijmegen. This selection program makes it possible for people checking and correcting the output to view the tagger output and correct the output by selecting the correct tag in a menu and have it automatically replace the erroneous tag. The use of the tag selection program prevented people from using non-existent tags, typos, etc., while it also speeded up the process of checking and correcting.
Reference
Van Eynde, F., J. Zavrel & W.
Daelemans. 2000. Part-of-Speech Tagging and Lemmatization for the Spoken
Dutch Corpus. In M. Gravilidou et al. (eds.), Proceedings of the Second
International Conference on Language Resources and Evaluation. 1427-1433.
Athens.
The CGN tagset and guidelines for its use have been documented in the protocol:
Van Eynde, F. 2003. Protocol voor
POS tagging en lemmatisering. (Here available in .pdf
format; Dutch only.)
The POS tagging together with the lemmatisation has been stored in the following files:
In Table 1 an overview is given of
the data that are available in version 1.0 of the corpus. For a description
of the design of the corpus and its motivation, we refer you to the description
of the corpus design.
Table 1. Overview of available
data (VL = data originating from Flanders, NL = data originating from the
Netherlands)
Component | Total number
of words |
|||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
2,626,172
|
878,383 | 1,747,789 |
b.
|
Interviews with teachers of Dutch |
565,433
|
315,554 | 249,879 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
1,208,633
|
465,096
|
743,537
|
d.
|
Spontaneous telephone dialogues (recorded on MD via a local interface) |
853,371
|
343,167 |
510,204
|
e.
|
Simulated business negotiations |
136,461
|
0 | 136,461 |
f. | Interviews/discussions/debates (broadcast) |
790,269
|
250,708 | 539,561 |
g.
|
(political) Discussions/debates/meetings (non-broadcast) |
360,328
|
138,819
|
221,509 |
h.
|
Lessons recorded in the classroom |
405,409
|
105,436
|
299,973
|
i.
|
Live (eg sports) commentaries (broadcast) |
208,399
|
78,022 | 130,377 |
j.
|
Newsreports/reportages (broadcast) |
186,072
|
95,206 | 90,866 |
k.
|
News (broadcast) |
368,153
|
82,855 | 285,298 |
l.
|
Commentaries/columns/reviews (broadcast) |
145,553
|
65,386 | 80,167 |
m.
|
Ceremonious speeches/sermons |
18,075
|
12,510 | 5,565 |
n.
|
Lectures/seminars |
140,901
|
79,067 | 61,834 |
o.
|
Read speech | 903,043 | 351,419 | 551,624 |
Total |
8,916,272
|
3,261,628 | 5,654,644 |
On the basis of the POS tagging of
the corpus an alphabetical frequencylist was derived which gives information
about the frequency with which POS tags are used with specific words. This
frequency list (tagalph.frq) can be found in the directory /data/lexicon/freqlists/
of the annoattion DVD. A description of the way this list is structured
can be found on ../../lexicon/freq_lst.htm