The .bpt format

Files of type .bpt (broad phonectic transcription) comprise a chronological representation of the word segmentation in an XML text format. The structure of this format is described by ftext.dtd which can be found on the annotation DVD. The bpt files in the directory /data/annot/xml/bpt-auto (also on the annotation DVD) have been derived from the automatic word segmentation (file type .awd). In these files also the duration of the individual phones are available. In addition there are the bpt-fon files (to be found in /data/annot/xml/bpt-fon on the annotation DVD) that have been derived from the manually verified word segmentation (file type .wrd).

<?xml version="1.0"?>
<!DOCTYPE ftext SYSTEM "ftext.dtd">
<ftext ref="fn123456">
  <fau ref="fn123456.1" s="N01161">
    <fw ref="fn123456.1.1"   w="ja"                fon="ja"
        left="SEP"           right="SEP"           fq="auto"
        times="67.905 67.978 68.112"/>
    <fw ref="fn123456.1.2"   w="da"                fon="dAs"
        left="SEP"           right="SHARE-W(da's)" fq="auto"
        times="68.112 68.143 68.205 68.267"/>
    <fw ref="fn123456.1.3"   w="'s"                fon="dAs"
        left="SHARE-W(da's)" right="SEP"           fq="auto"
        times="68.112 68.143 68.205 68.267"/>
    <fw ref="fn123456.1.4"   w="waar"              fon="war"
        left="SEP"           right="SEP"           fq="auto"
        times="68.267 68.319 68.371 68.423"/>
    <fl ref="fn123456.1.5"   w="."/>
  </fau>
  <fau ref="fn123456.2" s="N01169">
    <fw ref="fn123456.2.1"   w="en"                fon="En"
        left="SEP"           right="SEP"           fq="auto_unrel"
        times="69.040 71.868"/>
    <fw ref="fn123456.2.2"   w="hij"               fon="hE+"
        left="SEP"           right="SEP"           fq="auto"
        times="69.040 71.868"/>
    <fl ref="fn123456.2.5"   w="?"/>
  </fau>
  <fau ref="fn123456.3" s="N01167">
    <fw ref="fn123456.3.1"   w="en"                fon="En"
        left="SEP"           right="SEP"           fq="auto"
        times="87.043 87.073 87.124"/>
    <fw ref="fn123456.3.2"   w="ik"                fon="Ik"
        left="SEP"           right="SHARE-P(k)"    fq="auto"
        times="87.124 87.205 87.265"/>
    <fw ref="fn123456.3.3"   w="kan"               fon="kAn"
        left="SHARE-P(k)"    right="SHARE-NP(n)"   fq="auto"
        times="87.205 87.265 87.296 87.321"/>
    <fw ref="fn123456.3.4"   w="nog"               fon="nOx"
        left="SHARE-NP(n)"   right="SEP"           fq="auto"
        times="87.321 87.346 87.397 87.427"/>
    <fw ref="fn123456.3.5"   w="wel"               fon="wEl"
        left="SEP"           right="SEP"           fq="auto"
        times="87.427 87.457 87.487 87.528"/>
    <fl ref="fn123456.3.5"   w="."/>
  </fau>
  <fau ref="fn123456.4" s="N09099">
    <fw ref="fn123456.4.1"   w="netto"               fon="nEtow"
        left="SEP"           right="INSERT(w)"       fq="auto"
        times="328.409 328.500 328.530 328.640 328.680 328.700"/>
    <fw ref="fn123456.4.2"   w="is"                  fon="wIz"
        left="INSERT(w)"     right="SEP"             fq="auto"
        times="328.700 328.720 328.760 329.084"/>
    <fw ref="fn123456.4.3"   w="bruto"               fon="brYto"
        left="SEP"           right="SEP"             fq="auto"
        times="329.084 329.510 329.540 329.570 329.670 329.698"/>
    <fw ref="fn123456.4.4"   w="hè"                  fon="I"
        left="SEP"           right="SEP"             fq="auto_unrel"
        times="329.698 329.728"/>
    <fl ref="fn123456.4.5" w="?"/>
  </fau>
</ftext>

<ftext>

text with a broad phonetic transcription, word segmentation and phone segmentation

<fau>

an annotation unit. The boundaries of this element are determined by the punctuation mark.

<fw>

a word within the annotation unit (<fau>).

<fmu>

a mark-up unit that may comprise COMMENT or BACKGROUND information.

<tm>

a marker within the mark-up unit (<fmu>).

<fl>

a punctuation mark within the annotation unit (<fau>).

ref

The reference code consists of one, two or three parts (depending on the element it is associated with) that are separated by a full stop. The meaning is as follows:
<sample number>.<f[am]u rank number>.<f[wm] rank number>

spreaker ID. In the context of the <fau> element posible values of this attribute are: "Nxxxxx", "Vxxxxx" or "UNKOWN" where x denotes a digit. In the context of the <fmu> element the s attribute can have one of two values: "COMMENT" or "BACKGROUND".

the orthographic transcription of the word in the context of <fw> or a punctuation mark (".", "..." or "?") in the context of <fl>.

fon

the phonetic transcription of the word. Apart from the symbols from the phonetic symbol set (see the description of the .fon format) the percentage sign '%' is used to indicate a word internal pause.

left/right

the nature of the left/righ boundary of the word. There are five possible values for this attribute:

SEP	:	normally separated
SHARE-P(x)	:	shared plosive x
SHARE-NP(x)	:	shared non-plosive x
INSERT(x)	:	inserted phoneme x
SHARE-W(x)	:	shared full word (in CGN version 1.0 only "da's")

marked

translates the * coding in the original orthographic transcription (.ort format) as optional attribute of the <fw> element. Possible values are: foreign, dialect, incomplete, mispr, regionalpr and uncertain.

quality of the time interval has one of the following three values:
"man" (manually verified): time markers that have been inserted by a human
"auto" (automatically generated): time markers that have been generated by a machine and have not been validated.
"auto_unrel" (automatically generated, unreliable): markers generated by a machine which are known to be unreliable.

times

comprises the time stamps of the phone boundaries. The attribute always contains N+1 timestamps where N = number of phonemes + any word internal pauses ('%'). The first time stamp indicates the beginning of the first phoneme, the second one the beginning of the second phoneme, etc. The final time stamp indicates the end of the last phoneme.

All the characters from the ISO-8859.1 characterset that were used in the transcription which fall outside the 7-bit range, have been translated according to the Character entity references for ISO 8859-1 characters. The set of special characters used can be found in the ttext.dtd to be found on the annotation DVD. In entities.htm an overview is given of the various standards for this character (sub)set.