The .skp format

Files of type .skp (signal linking data) are an chronological representation of the orthgraphic transcription in an XML text format. The structure of this format is described by ttext.dtd on the annotation DVD. Apart from the transcription, this format comprises the time information. The skp files in the directory /data/annot/xml/skp-ort on the annotation DVD. have been derived from the file type .ort. In addition, there are skp-wrd files (/data/annot/xml/skp-wrd on the annotation DVD) that have been derived from the manually verified word segmentation (files of type .wrd), and skp-auto files (/data/annot/xml/skp-auto also on the annotation DVD) that have been derived from the automatic word segmentations (files of type .awd).


<?xml version="1.0"?>
<!DOCTYPE ttext SYSTEM "ttext.dtd">
<ttext ref="fn123456">
  <tmu ref="fn123456.1" s="COMMENT" tb="0.000" te="599.523" tt="eq" tq="man">
    <tm ref="fn123456.1.1"  tb="0.000" te="599.523" tt="in" tq="man"   m="De"/>
    <tm ref="fn123456.1.2"  tb="0.000" te="599.523" tt="in" tq="man"   m="televisie"/>
    <tm ref="fn123456.1.3"  tb="0.000" te="599.523" tt="in" tq="man"   m="staat"/>
    <tm ref="fn123456.1.4"  tb="0.000" te="599.523" tt="in" tq="man"   m="aan"/>
    <tm ref="fn123456.1.5"  tb="0.000" te="599.523" tt="in" tq="man"   m="op"/>
    <tm ref="fn123456.1.6"  tb="0.000" te="599.523" tt="in" tq="man"   m="de"/>
    <tm ref="fn123456.1.7"  tb="0.000" te="599.523" tt="in" tq="man"   m="achtergrond."/>
  </tmu>
  <tau ref="fn123456.2" s="N01168" tb="0.251" te="2.250" tt="eq" tq="man">
    <tw ref="fn123456.2.1"  tb="0.251" te="2.250" tt="in" tq="man"     w="maar"/>
    <tw ref="fn123456.2.2"  tb="0.251" te="2.250" tt="in" tq="man"     w="zij"/>
    <tw ref="fn123456.2.3"  tb="0.251" te="2.250" tt="in" tq="man"     w="gaat"/>
    <tw ref="fn123456.2.4"  tb="0.251" te="2.250" tt="in" tq="man"     w="uh"/>
    <tw ref="fn123456.2.5"  tb="0.251" te="2.250" tt="in" tq="man"     w="drankjes"/>
    <tw ref="fn123456.2.6"  tb="0.251" te="2.250" tt="in" tq="man"     w="verkopen"/>
  </tau>
  <tau ref="fn123456.3" s="N01167" tb="3.016" te="4.204" tt="eq" tq="man">
    <tw ref="fn123456.3.1"  tb="3.016" te="4.204" tt="in" tq="man"     w="gratis"/>
    <tw ref="fn123456.3.2"  tb="3.016" te="4.204" tt="in" tq="man"     w="verkopen"/>
  </tau>
  <tau ref="fn123456.4" s="N01168" tb="115.481" te="116.108" tt="eq" tq="man">
    <tw ref="fn123456.4.1" tb="115.481" te="116.108" tt="in" tq="man" w="nou"/>
    <tw ref="fn123456.4.2" tb="115.481" te="116.108" tt="in" tq="man" w="ok&eacute;"/>
  </tau>
  <tmu ref="fn123456.5" s="BACKGROUND" tb="122.867" te="126.395" tt="eq" tq="man">
    <tm ref="fn123456.5.1" tb="122.867" te="126.395" tt="in" tq="man" m="inschenken"/>
    <tm ref="fn123456.5.2" tb="122.867" te="126.395" tt="in" tq="man" m="water."/>
  </tmu>
  <tau ref="fn123456.6" s="N01169" tb="138.171" te="138.954" tt="eq" tq="man">
    <tw ref="fn123456.6.1" tb="138.171" te="138.954" tt="in" tq="man" w="dat"/>
    <tw ref="fn123456.6.2" tb="138.171" te="138.954" tt="in" tq="man" w="hoorde"/>
    <tw ref="fn123456.6.3" tb="138.171" te="138.954" tt="in" tq="man" w="i"/>
  </tau>
  ...
</text>

<ttext> a time aligned text.
<tau> a time aligned annotation unit. The boundaries of this element are determined by the punctuation mark that has not been included in this format.
<tw> a time aligned word within a time aligned annotation unit (<tau>).
<tmu> a time aligned mark-up unit that may contain COMMENT or BACKGROUND information.
<tm> a time aligned marker within the time aligned mark-up unit (<tmu>).
ref The reference code comprises one, two or three parts (depending on the element to which it belongs) separated by a period. The code should be interpreted as follows: <sample ID>.<t[am]u-rank number>.<t[wm]-rank number>
s speaker ID. In the context of the <tau> element the possible values for this attribute are: Nxxxxx, Vxxxxx or UNKOWN where x denotes a digit. In the context of the <tmu> element the s-attribute has either of two values: COMMENT or BACKGROUND.
w the orthographic transcription of the word.
m the orthographic transcription of a marker.
tb time begin (in seconds) of a time aligned annotation unit.
te time end (in seconds) of a time aligned annotation unit.
tt typeof a time span. eq (equality) is used to indicate that the annotation unit coincides with the time span that is delimited by tb and te. in (inclusion) indicates that the unit falls within the time span.
tq quality of the time span has one of the following three values:
man (manually): time markers have been inserted manually.
auto (automatically): time markers have been inserted automatically and have not been manually verified.
auto_unrel (automatically, unreliable): time markers that have been generated automatically and which are known to be unreliable.

All characters in the transcriptions that belong to the ISO-8859.1 character set that fall outside the 7-bit range have been converted according to the Character entity references for ISO 8859-1 characters. The subset of special characters that were used can be found in the ttext.dtd. In entities.htm an overview is presented of the different standards for this character (sub)set.