The .plk format

Files of type .plk contain information about the part-of-speech tagging, lemmatisation, lexicon link-up and information about multi-word expressions.


<au id="1" s="N01036" tb="000.000">
ga         WW(pv,tgw,ev)    gaan        93037  30559
je         VNW(pers,...)    je          620014 135108
nou        BW()             nou         620167 135232
met        VZ(init)         met         620087 135170
de         LID(bep,...)     de          619612 134796
trein      N(soort,ev,...)  trein       317006 104897
naar       VZ(init)         naar        620133 135200
Loon       SPEC(deeleigen)  _           0      0      Loon_Op_Zand              608839        8,9,10
Op         SPEC(deeleigen)  _           0      0      Loon_Op_Zand              608839        8,9,10
Zand       SPEC(deeleigen)  _           0      0      Loon_Op_Zand              608839        8,9,10
of         VG(neven)        of          620170 135234
met        VZ(init)         met         620087 135170
de         LID(bep,...)     de          619612 134796
bus        N(soort,ev,...)  bus         54520|54521  16763|16764
?          LET()            ?           0      0
<mu id="2" s="BACKGROUND" tb="152.867">
inschenken SPEC(achter)     _           0      0
water.     SPEC(achter)     _           0      0
<au id="3" s="N01265" tb="175.824">
ja         TSW()            ja          141336 45366
Partij     SPEC(deeleigen)  _           0      0      Partij_Van_De_Arbeid      610975        2,3,4,5
Van        SPEC(deeleigen)  _           0      0      Partij_Van_De_Arbeid      610975        2,3,4,5
De         SPEC(deeleigen)  _           0      0      Partij_Van_De_Arbeid      610975        2,3,4,5
Arbeid     SPEC(deeleigen)  _           0      0      Partij_Van_De_Arbeid      610975        2,3,4,5
is         WW(pv,tgw,ev)    zijn        141101 122511
iets       VNW(onbep,...)   iets        619991 135089
vooruit    BW()             vooruit     620510 135518 vooruitgaan               504346        8,9
gegaan     WW(vd,vrij,...)  gaan        98566  30559  vooruitgaan/achteruitgaan 504346/500431 8,9/9,13
't         LID(bep,...)     het         619904 135669
CDA        N(eigen,ev,...)  CDA         381902 125724
iets       VNW(onbep,...)   iets        619991 135089
achteruit  BW()             achteruit   619374 134626 achteruitgaan             500431        9,13
SP         N(eigen,ev,...)  SP          393723 132419
verdubbeld WW(vd,...)       verdubbelen 333336 109296
.          LET()            .           0      0

In a .plk file two types of lines occur:

<au> an annotation unit. The boundaries of this element are determined by the puntuation mark.
<mu> a mark-up unit that may contain COMMENT or BACKGROUND information.
s speakerID. In the context of the <au> element possible values of this attribute are: Nxxxxx, Vxxxxx or UNKOWN where x denotes a digit. In the context of the <mu> element the s-attribute may have either of two values: COMMENT or BACKGROUND.
tb time begin (in seconds) of the annotation unit. The time begin has been derived from the .ort file. A time marker may coincide with a sentence boundary, but this need not be the case. Therefore, the time begin may be somewhat earlier than the actual beginning of the sentence in the audio file.
column1 word form (token) as it occurs in the orthogrfaphic transcription (cf. data in the .ort files)
column2 part-of-speech tag that has been assigned to a token. For an overview of the tags that were used, see /data/annot/text/plk/tagset.txt on the annotation DVD.
column3 lemma of the token. The underscore ("_") indicates that a lemma is absent.
column4 lexicon-ID of the word form. The ID refers to the single-word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD)
column5 lexicon-ID of the lemma associated with the word form. The ID refers to the single-word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD)
column6 multi-word lemma (when different from column 3)
column7 lexicon-ID of the multi-word lemma. The ID refers to the multi-word lexicon (/data/lexicon/text/cgnmlex.txt on the annotation DVD)
column8 References to the different parts of the multi-word expression by means of the rank number of the word in the sentence.

A lexicon-ID with the value "0" signifies that the lemma or the word form has not been linked-up by the lexicon (i.e. is not considered to be part of a multi-word expression). Whenever an element of a multi-word expression has been omitted, as in ik deed (aandoen en uitdoen) het licht aan en uit, then the lemmas that occur with the word form deed are separated by a forward dash ("/"), the same goes for the associated lexicon-IDs in the next column. When a lemma or word form is ambiguous (there are multiple references to the lexicon), the lexicon-IDs are separated by a vertical bar ("|").