The .pri format

Files of type .pri (primary data) have been derived from files with file type .ort. It is the chronological representation of the orthographic transcription in XML format. The structure of this XML format is described by the text.dtd on the annotation DVD.


<?xml version="1.0"?>
<!DOCTYPE text SYSTEM "text.dtd">
<text id="fn123456">
  <mu id="fn123456.1" s="COMMENT">
    <m id="fn123456.1.1">                          De              </m>
    <m id="fn123456.1.2">                          televisie       </m>
    <m id="fn123456.1.3">                          staat           </m>
    <m id="fn123456.1.4">                          aan             </m>
    <m id="fn123456.1.5">                          op              </m>
    <m id="fn123456.1.6">                          de              </m>
    <m id="fn123456.1.7">                          achtergrond.    </m>
  </mu>
  <au id="fn123456.2" s="N01168">
    <w id="fn123456.2.1">                          maar            </w>
    <w id="fn123456.2.2">                          zij             </w>
    <w id="fn123456.2.3">                          gaat            </w>
    <w id="fn123456.2.4">                          uh              </w>
    <w id="fn123456.2.5">                          drankjes        </w>
    <w id="fn123456.2.6">                          verkopen        </w>
    <l id="fn123456.2.7">                          .               </l>
  </au>
  <au id="fn123456.3" s="N01167">
    <w id="fn123456.3.1">                          gratis          </w>
    <w id="fn123456.3.2">                          verkopen        </w>
    <l id="fn123456.3.3">                          ?               </l>
  </au>
  ...
  <au id="fn123456.4" s="N01168">
    <w id="fn123456.4.1">                          nou             </w>
    <w id="fn123456.4.2">                          ok&eacute;      </w>
    <l id="fn123456.4.3">                          .               </l>
  </au>
  <mu id="fn123456.5" s="BACKGROUND">
    <m id="fn123456.5.1">                          inschenken      </m>
    <m id="fn123456.5.2">                          water.          </m>
  </mu>
  <au id="fn123456.6" s="N01169">
    <w id="fn123456.6.1">                          dat             </w>
    <w id="fn123456.6.2">                          hoorde          </w>
    <w id="fn123456.6.3" marked="incomplete">      i               </w>
    <l id="fn123456.6.4">                          ...             </l>
  </au>
  ...
</text>

<text> text.
<au> an annotation unit. The boundaries of this element are determined by the punctuation mark.
<w> a word within an annotation unit (<au>).
<l> the punctuation mark within an annotation unit (<au>). There are three possible values for this element: ".", "..." or "?".
<mu> a mark-up unit which may contain COMMENT or BACKGROUND information
<m> a marker within the mark-up unit (<mu>).
s speaker ID. In the context of the <au> element this attribute may have the following values: Nxxxxx, Vxxxxx or UNKOWN where x denotes a digit. In the context of the <mu> element there are two possible values for the s attribute: COMMENT or BACKGROUND.
id The identification code is composed of one, two or three parts (depending on the element to which it belongs) that are separated by a period. The information is as follows:
<sample ID>.<annotation unit rank number>.<word/marker/punctuation sign-rank number>
marked translates the *-codes in the original orthographic transcription (.ort format) as optional attribute of the <w> element. Possible values are: foreign, dialect, incomplete, mispr, regionalpr and uncertain (corresponding to *v, *d, *a, *u, *z and *x respectively).

All characters in the transcriptions that belong to the ISO-8859.1 character set that fall outside the 7-bit range have been converted according to the Character entity references for ISO 8859-1 characters. The subset of special characters that were used can be found in the text.dtd. In entities.htm an overview is presented of the different standards for this character (sub)set.