Part of speech tagging

The entire corpus was tagged for part-of-speech information. Within the project a tagset was defined which comprises 316 tags. The tagset closely follows the practices of the authoritative Dutch Grammar ANS (Haeseryn et al., 1997). The tagset is conform the  EAGLES guidelines and is described in Van Eynde (2003; here available in .pdf format).

For the tagging use was made of a tagger which would assign to each word the most probable tag. The tagger output was checked and where necessary corrected.

Below the aim and motivation for the part-of-speech tagging is outlined. We also describe in general terms the protocol that was developed, the procedure that was adopted, while also information is provided about the file types and formats that are used. An overview is given of the data that are available in version 1.0 of the corpus. Finally, we refer to the frequency list that is available.

Read more about



Aim and motivation

The enrichment of the corpus with POS information is one of the few forms of annotation that are availbale for the entire corpus.The addition of POS tags makes it possible to investigate the use of words (i.e. orthographic words or rather: tokens). Where many word forms are ambiguous in isolation, this is seldom the case when the word form is used in context. Consider, for example, the word form werk. This can be a noun or a verb. In the context het was zwaar werk it is immediately apparent that the noun interpretation is the correct one, while in the sentence ik werk altijd hard the verb tag is required. The POS tagging makes it possible for the user to search not only for literal strings, but also for specific parts of speech and certain morpho-syntactic characteristics (eg gender, degree, number). It is also possible to look for subclasses of parts of speech such as postnominal adjectives.

The tagset distinguishes the ten major parts of speech that are usually distinguished for Dutch (see eg Algemene Nederlandse Spraakkunst). Because the tagset includes a great amount of detail, the tagset comprises 316 different tags.

Reference

Haeseryn, W., K. Romijn, G. Geerts, J. de Rooij & M. van den Toorn. 1997. Algemene Nederlandse Spraakkunst. Groningen: Nijhoff and Deurne: Wolters Plantyn.

Return to the top of this page.



Procedure

For the assignment of POS tags the folliwing principles were adopted:

In order to speed up and facilitate the annotation process a POS tagging system was used that had been developed by the University of Tilburg. The tagging system makes use of a combination of four different taggers: the TnT tagger, a Brill tagger, a maximum entropy tagger and a memory-based tagger. For further details, see Van Eynde et al. (2000).

All output was manually verified. For the verification a tag selection program was used that had been developed at the University of Nijmegen. This selection program makes it possible for people checking and correcting the output to view the tagger output and correct the output by selecting the correct tag in a menu and have it automatically replace the erroneous tag. The use of the tag selection program prevented people from using non-existent tags, typos, etc., while it also speeded up the process of checking and correcting.

Reference

Van Eynde, F., J. Zavrel & W. Daelemans. 2000. Part-of-Speech Tagging and Lemmatization for the Spoken Dutch Corpus. In M. Gravilidou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation. 1427-1433. Athens.
 
 

Return to the top of this page.


Protocol

The CGN tagset and guidelines for its use have been documented in the protocol:

Van Eynde, F. 2003. Protocol voor POS tagging en lemmatisering. (Here available in .pdf format; Dutch only.)
 

Return to the top of this page.

File types and formats

The POS tagging together with the lemmatisation has been stored in the following files:

For the formats mentioned above, see also the descriptions of the plk format and the tag format.
 
 
Return to the top of this page.


Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.
 

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)
 
 
Component Total number 
of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
2,626,172
 878,383 1,747,789
b.
Interviews with teachers of Dutch
565,433
 315,554 249,879
c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633
465,096
743,537
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
 343,167
510,204
  e.
Simulated business negotiations
136,461
 0  136,461
  f. Interviews/discussions/debates (broadcast)
790,269
250,708  539,561
  g.
(political) Discussions/debates/meetings (non-broadcast)
360,328
138,819
 221,509
h.
Lessons recorded in the classroom
405,409
105,436
299,973
i.
Live (eg sports) commentaries (broadcast)
208,399
 78,022  130,377
j.
Newsreports/reportages (broadcast)
186,072
 95,206  90,866
k.
News (broadcast)
368,153
 82,855  285,298
l.
Commentaries/columns/reviews (broadcast)
145,553
 65,386  80,167
m.
Ceremonious speeches/sermons
18,075
 12,510  5,565
n.
Lectures/seminars
140,901
 79,067  61,834
o.
Read speech  903,043 351,419 551,624
Total
8,916,272
3,261,628 5,654,644

 
 

Return to the top of this page.



Frequencies of tags

On the basis of the POS tagging of the corpus an alphabetical frequencylist was derived which gives information about the frequency with which POS tags are used with specific words. This frequency list (tagalph.frq) can be found in the directory /data/lexicon/freqlists/ of the annoattion DVD. A description of the way this list is structured can be found on ../../lexicon/freq_lst.htm
 

Return to the top of this page.