The .tag format

Files of type .tag (part-of-speech tagging, lemmatisation and lexicon link-up) have been derived from files of type .plk. It is a chronological representation of this annotation type in an XML text format. The structure of this XML text format is described by ptext.dtd on the annotation DVD.


<?xml version="1.0"?>
<!DOCTYPE ptext SYSTEM "ptext.dtd">
<ptext ref="fn123456">
 <pau ref="fn123456.1" s="N01036">
  <pw ref="fn123456.1.1"  w="ga"          pos="WW(pv,tgw,ev)" lem="gaan"
    wid="93037" lid="30559" nlid="30559#1" pq="man"/>
  <pw ref="fn123456.1.2"  w="je"          pos="VNW(pers,pron,nomin,red,2v,ev)" lem="je"
    wid="620014" lid="135108" nlid="135108#1" pq="man"/>
  <pw ref="fn123456.1.3"  w="nou"         pos="BW()" lem="nou"
    wid="620167" lid="135232" nlid="135232#1" pq="man"/>
  <pw ref="fn123456.1.4"  w="met"         pos="VZ(init)" lem="met"
    wid="620087" lid="135170" nlid="135170#1" pq="man"/>
  <pw ref="fn123456.1.5"  w="de"          pos="LID(bep,stan,rest)" lem="de"
    wid="619612" lid="134796" nlid="134796#1" pq="man"/>
  <pw ref="fn123456.1.6"  w="trein"       pos="N(soort,ev,basis,zijd,stan)" lem="trein"
    wid="317006" lid="104897" nlid="104897#1" pq="man"/>
  <pw ref="fn123456.1.7"  w="naar"        pos="VZ(init)" lem="naar"
    wid="620133" lid="135200" nlid="135200#1" pq="man"/>
  <pw ref="fn123456.1.8"  w="Loon"        pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="608839#3" pq="man"/>
  <pw ref="fn123456.1.9"  w="Op"          pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="608839#3" pq="man"/>
  <pw ref="fn123456.1.10" w="Zand"        pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="608839#3" pq="man"/>
  <pw ref="fn123456.1.11" w="of"          pos="VG(neven)" lem="of"
    wid="620170" lid="135234" nlid="135234#1" pq="man"/>
  <pw ref="fn123456.1.12" w="met"         pos="VZ(init)" lem="met"
    wid="620087" lid="135170" nlid="135170#1" pq="man"/>
  <pw ref="fn123456.1.13" w="de"          pos="LID(bep,stan,rest)" lem="de"
    wid="619612" lid="134796" nlid="134796#1" pq="man"/>
  <pw ref="fn123456.1.14" w="bus"         pos="N(soort,ev,basis,zijd,stan)" lem="bus"
    wid="54520|54521" lid="16763|16764" nlid="16763|16764#1" pq="man"/>
  <pl ref="fn123456.1.15" w="?"           pos="LET()" lem="?"
    wid="0" lid="0" nlid="0#1" pq="man"/>
 </pau>
<pau ref="fn123456.2" s="N01265">
  <pw ref="fn123456.2.1"   w="ja"         pos="TSW()" lem="ja"
    wid="141336" lid="45366" nlid="45366#1" pq="man"/>
  <pw ref="fn123456.2.2"   w="Partij"     pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="610975#4" pq="man"/>
  <pw ref="fn123456.2.3"   w="Van"        pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="610975#4" pq="man"/>
  <pw ref="fn123456.2.4"   w="De"         pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="610975#4" pq="man"/>
  <pw ref="fn123456.2.5"   w="Arbeid"     pos="SPEC(deeleigen)" lem="_"
    wid="0" lid="0" nlid="610975#4" pq="man"/>
  <pw ref="fn123456.2.6"   w="is"         pos="WW(pv,tgw,ev)" lem="zijn"
    wid="141101" lid="122511" nlid="122511#1" pq="man"/>
  <pw ref="fn123456.2.7"   w="iets"       pos="VNW(onbep,pron,stan,vol,3o,ev)" lem="iets"
    wid="619991" lid="135089" nlid="135089#1" pq="man"/>
  <pw ref="fn123456.2.8"   w="vooruit"    pos="BW()" lem="vooruit"
    wid="620510" lid="135518" nlid="504346#2" pq="man"/>
  <pw ref="fn123456.2.9"   w="gegaan"     pos="WW(vd,vrij,zonder)" lem="gaan"
    wid="98566" lid="30559" nlid="500431#2" pq="man"/>
  <pw ref="fn123456.2.10"  w="'t"         pos="LID(bep,stan,evon)" lem="het"
    wid="619904" lid="135669" nlid="135669#1" pq="man"/>
  <pw ref="fn123456.2.11"  w="CDA"        pos="N(eigen,ev,basis,onz,stan)" lem="CDA"
    wid="381902" lid="125724" nlid="125724#1" pq="man"/>
  <pw ref="fn123456.2.12"  w="iets"       pos="VNW(onbep,pron,stan,vol,3o,ev)" lem="iets"
    wid="619991" lid="135089" nlid="135089#1" pq="man"/>
  <pw ref="fn123456.2.13"  w="achteruit"  pos="BW()" lem="achteruit"
    wid="619374" lid="134626" nlid="500431#2" pq="man"/>
  <pw ref="fn123456.2.14"  w="SP"         pos="N(eigen,ev,basis,zijd,stan)" lem="SP"
    wid="393723" lid="132419" nlid="132419#1" pq="man"/>
  <pw ref="fn123456.2.15"  w="verdubbeld" pos="WW(vd,vrij,zonder)" lem="verdubbelen"
    wid="333336" lid="109296" nlid="109296#1" pq="man"/>
  <pl ref="fn123456.2.16"  w="."          pos="LET()" lem="."
    wid="0" lid="0" nlid="0#1" pq="man"/>
 </pau>
</ptext>

<ptext> text with part-of-speech tagging, lemmatisation and lexicon link-up.
<pau> an annoattion unit. The boundaries for this element are determined by the punctuation mark.
<pw> a word within an annotation unit (<pau>).
<pl> the punctuation mark within the annotation unit (<pau>). There are three possible values for this element: ".", "..." or "?".
<pmu> a mark-up unit that may comprise COMMENT or BACKGROUND information.
<pm> a marker within the mark-up unit (<pmu>).
ref The identification code is composed of one, two, or three parts (depending on the element with which it is associated) separated by a full stop. The meaning is as follows:
<sample number>.<annotation unit, rank number>.<word/marker/punctuation mark, rank number>
s spreaker ID. In the context of the <pau> element possible values of this attribute are: Nxxxxx, Vxxxxx or UNKOWN where x denotes a digit. In the context of the <pmu> element the s attribute has one of two possible values: COMMENT or BACKGROUND.
w word form as it occurs in the orthographic transcription (cf. data in the .ort files)
pos part-of-speech tag that has been assigned to the word form
lem lemma of the word form. The underscore "_" indicates the fact that a lemma is missing
wid Lexicon ID of the word form. The ID refers to the single word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD). wid="0" is used when no corresponding lemma occurs in the lexicon. If there is more than one reference to the lexicon, and therefore the word form is found to be ambigous, then the lexicon IDs are separated by a vertical bar ("|"): eg wid="54520|54521".
lid Lexicon ID of the lemma of the word form. The ID refers to the single word lexicon (/data/lexicon/text/cgnlex.txt on the annotation DVD). lid="0" is used when no corresponding lemma occurs in the lexicon. If there is more than one reference to the lexicon, and therefore the word form is found to be ambigous, then the lexicon IDs are separated by a vertical bar ("|"): eg lid="16763|16764".
nlid Lexicon ID of the multi-word lemma, followed by a hash ("#"), followed by the number of items in the multi-word expression. The ID refers to the multi-word lexicon (/data/lexicon/text/cgnmlex.txt on the annotation DVD). If it does not concern a multi-word expression, then the digit following the hash is  "1" (eg. nlid="122511#1"). Multiple references to possible multi-word lemmas in the lexicon are separated by a vertical bar ("|") (eg nlid="16763|16764#1). nlid="0" is used if there is no corresponding multi-word lemma in the lexicon.
pq quality of the part-of-speech tag (pos) has one of two possible values:
man (manually verified): POS tag provided and/or checked by a human.
auto (automatically generated): POS tag provided by a machine and not verified. 
marked translates the * coding in the original orthographic transcription (.ort-formaat) as optionel attribute of the <pw> element. Possible values are: foreign, dialect, incomplete, mispr, regionalpr and  uncertain (corresponding to *v, *d, *a, *u, *z and *x resp.).

All characters from the ISO-8859.1 teken set used in the transcription that fall outside the 7-bit range have been translated according to the Character entity references for ISO 8859-1 characters. The (sub) set of special characters can be found in ptext.dtd on the annotation DVD. In entities.htm an overview can be found of the various standards for this character (sub) set.