Part of speech tagging

Part of speech tagging

The entire corpus was tagged for part-of-speech information. Within the project a tagset was defined which comprises 316 tags. The tagset closely follows the practices of the authoritative Dutch Grammar ANS (Haeseryn et al., 1997). The tagset is conform the EAGLES guidelines and is described in Van Eynde (2003; here available in .pdf format).

For the tagging use was made of a tagger which would assign to each word the most probable tag. The tagger output was checked and where necessary corrected.

Below the aim and motivation for the part-of-speech tagging is outlined. We also describe in general terms the protocol that was developed, the procedure that was adopted, while also information is provided about the file types and formats that are used. An overview is given of the data that are available in version 1.0 of the corpus. Finally, we refer to the frequency list that is available.

Read more about

aim and moitivation
procedure
protocol
file types and formats
overview of available data
frequencies of tags

Aim and motivation

The enrichment of the corpus with POS information is one of the few forms of annotation that are availbale for the entire corpus.The addition of POS tags makes it possible to investigate the use of words (i.e. orthographic words or rather: tokens). Where many word forms are ambiguous in isolation, this is seldom the case when the word form is used in context. Consider, for example, the word form werk. This can be a noun or a verb. In the context het was zwaar werk it is immediately apparent that the noun interpretation is the correct one, while in the sentence ik werk altijd hard the verb tag is required. The POS tagging makes it possible for the user to search not only for literal strings, but also for specific parts of speech and certain morpho-syntactic characteristics (eg gender, degree, number). It is also possible to look for subclasses of parts of speech such as postnominal adjectives.

The tagset distinguishes the ten major parts of speech that are usually distinguished for Dutch (see eg Algemene Nederlandse Spraakkunst). Because the tagset includes a great amount of detail, the tagset comprises 316 different tags.

Reference

Haeseryn, W., K. Romijn, G. Geerts, J. de Rooij & M. van den Toorn. 1997. Algemene Nederlandse Spraakkunst. Groningen: Nijhoff and Deurne: Wolters Plantyn.

Return to the top of this page.

Procedure

For the assignment of POS tags the folliwing principles were adopted:

the word-by-word principle: tags were assigned to each token separately. This principle also applied in cases where the token was part of a multi-word expression. Thus ter and zake in the multi-word expression ter zake are tagged as two independent tokens. The same principle was also applied to words that usually only occur as part of a mulit-word (eg nervosa which normally only occurs in combination wth anorexia in anorexia nervosa). Compound words were tagged as single words: eg daarlangs received a single tag (adverb) instead of an adverbial pronoun followed by a postposition.
the form-before-function principle: according to this principle the formal (form) and morphological criteria take precedence over functional or semantic criteria. Thus for example maandag was always tagged as a noun, even in cases where it occurred in adverbial position (as for example in maandag sta ik niet graag op).
the principle of obligatorily contextual disambiguation: according to this principle words that are potentially ambiguous are assigned a single tag, viz. the tag that is appropriate in the given context.

In order to speed up and facilitate the annotation process a POS tagging system was used that had been developed by the University of Tilburg. The tagging system makes use of a combination of four different taggers: the TnT tagger, a Brill tagger, a maximum entropy tagger and a memory-based tagger. For further details, see Van Eynde et al. (2000).

All output was manually verified. For the verification a tag selection program was used that had been developed at the University of Nijmegen. This selection program makes it possible for people checking and correcting the output to view the tagger output and correct the output by selecting the correct tag in a menu and have it automatically replace the erroneous tag. The use of the tag selection program prevented people from using non-existent tags, typos, etc., while it also speeded up the process of checking and correcting.

Reference

Van Eynde, F., J. Zavrel & W. Daelemans. 2000. Part-of-Speech Tagging and Lemmatization for the Spoken Dutch Corpus. In M. Gravilidou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation. 1427-1433. Athens.

Return to the top of this page.

Protocol

The CGN tagset and guidelines for its use have been documented in the protocol:

Van Eynde, F. 2003. Protocol voor POS tagging en lemmatisering. (Here available in .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The POS tagging together with the lemmatisation has been stored in the following files:

files of type .plk. The format of these files is ASCII. These files can be found in the directory /data/annot/text/plk/ of the annotation DVD
files of type .tag. The format of these files is XML. These files can be found in the directory /data/annot/xml/tag/ of the annotation DVD

For the formats mentioned above, see also the descriptions of the plk format and the tag format.

Return to the top of this page.

Overview of available data

In Table 1 an overview is given of the data that are available in version 1.0 of the corpus. For a description of the design of the corpus and its motivation, we refer you to the description of the corpus design.

Table 1. Overview of available data (VL = data originating from Flanders, NL = data originating from the Netherlands)

Component Total number
of words

VL NL

a.
Spontaneous conversations ('face-to-face')
2,626,172
878,383 1,747,789

b.
Interviews with teachers of Dutch
565,433
315,554 249,879

c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633

465,096

743,537

d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
343,167
510,204

e.
Simulated business negotiations
136,461
0 136,461

f. Interviews/discussions/debates (broadcast)
790,269
250,708 539,561

g.
(political) Discussions/debates/meetings (non-broadcast)
360,328

138,819
221,509

h.
Lessons recorded in the classroom
405,409

105,436

299,973

i.
Live (eg sports) commentaries (broadcast)
208,399
78,022 130,377

j.
Newsreports/reportages (broadcast)
186,072
95,206 90,866

k.
News (broadcast)
368,153
82,855 285,298

l.
Commentaries/columns/reviews (broadcast)
145,553
65,386 80,167

m.
Ceremonious speeches/sermons
18,075
12,510 5,565

n.
Lectures/seminars
140,901
79,067 61,834

o.
Read speech 903,043 351,419 551,624

Total
8,916,272
3,261,628 5,654,644

Component	Total number of words
VL	NL
a.	Spontaneous conversations ('face-to-face')	2,626,172	878,383	1,747,789
b.	Interviews with teachers of Dutch	565,433	315,554	249,879
c.	Spontaneous telephone dialogues (recorded via a switchboard)	1,208,633	465,096	743,537
d.	Spontaneous telephone dialogues (recorded on MD via a local interface)	853,371	343,167	510,204
e.	Simulated business negotiations	136,461	0	136,461
f.	Interviews/discussions/debates (broadcast)	790,269	250,708	539,561
g.	(political) Discussions/debates/meetings (non-broadcast)	360,328	138,819	221,509
h.	Lessons recorded in the classroom	405,409	105,436	299,973
i.	Live (eg sports) commentaries (broadcast)	208,399	78,022	130,377
j.	Newsreports/reportages (broadcast)	186,072	95,206	90,866
k.	News (broadcast)	368,153	82,855	285,298
l.	Commentaries/columns/reviews (broadcast)	145,553	65,386	80,167
m.	Ceremonious speeches/sermons	18,075	12,510	5,565
n.	Lectures/seminars	140,901	79,067	61,834
o.	Read speech	903,043	351,419	551,624
Total	8,916,272	3,261,628	5,654,644

Return to the top of this page.

Frequencies of tags

On the basis of the POS tagging of the corpus an alphabetical frequencylist was derived which gives information about the frequency with which POS tags are used with specific words. This frequency list (tagalph.frq) can be found in the directory /data/lexicon/freqlists/ of the annoattion DVD. A description of the way this list is structured can be found on ../../lexicon/freq_lst.htm

Return to the top of this page.