Broad phonetic transcription

A part of the data in the corpus was enriched with a manually verified broad phonetic transcription. The manual transcription comprises the verification and correction of a given automatically generated phonetic transcription. The transcriptions are broad in the sense that no allophonic variation or diacritics are in the pre-defined phoneme set.

Below information can be found about the aims, the transcription procedure that was adopted, and the protocol that was used. We also refer to the relevant file types and formats. Finally, an overview is presented of the data that are available in version 1.0.

Read more about


Aim and motivation

The main goal was to obtain verified broad phonetic transcriptions of the material by means of insertions, deletions and substitutions in the given automatically generated transcription. No gradual processes, such as voicing-devoicing in plosives and fricatives and diphthongization and monophthongization of vocals, are transcribed in this broad phonetic transcription.

The phoneme set that was used is described here (in a .ps file or a .pdf file).
 

Return to the top of this page.


Procedure

The automatically generated transcription not only resulted in a more efficient transcription procedure in terms of time, but it also increased the consistency between the transcribers. The human transcribers’ task was to listen to the speech signal and decide for each symbol in the transcript whether it should be deleted, substituted by another phoneme, or whether one or more phonemes were missing in the given transcription.

The PRAAT software was used to create the manual broad phonetic transcription. One of the advantages of the PRAAT program is that it’s possible to display both an oscillogram of the speech signal and the accompanying transcription and to replay the speech signal if required. It was decided to only display the given phonetic transcription, so without the original orthographic transcription. More conversational and extemporaneous types of speech, which are known to be more difficult to transcribe, were created in two rounds to ensure that the quality of the transcription would meet a certain level. This means that the transcription of one human transcriber was submitted to another transcriber who was asked to verify and correct this transcription.

More information about the procedure and the transcription quality of the final result can be found in Goddijn and Binnenpoorte (2003).

Reference:

S. Goddijn & D. Binnenpoorte, ‘Assessing Manually Corrected Broad Phonetic Transcriptions in the Spoken Dutch Corpus’, in Proceedings of 15th ICPhS, Barcelona, Spain, pp. 1361-1364, 2003.
 

Return to the top of this page.

Protocol

In a protocol (Gillis, 2001), the transcription rules are stated in order to establish a higher consistency between the transcribers. In this protocol, the phoneme set and additional symbols are described, and many examples are given of how to use the phonemes. One of the main guidelines was not to have too much confidence in the given transcription, but to decide on a phoneme on the basis of one's own perception. Only in case of doubt, the original symbol could maintain in the transcription.

Literatuurverwijzing:

Gillis, S. 2001. Protocol voor brede fonetische transcriptie. (Available here in .ps and .pdf format; Dutch only.)
 
 

Return to the top of this page.

File types and formats

The broad phonetic transcriptions are stored in the following types of files:
 

 

Return to the top of this page.


Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.
 

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)
 
Component Total number 
of words 
VL
NL
a.
Spontaneous conversations ('face-to-face')
177,127
 70,945 106,182
b.
Interviews with teachers of Dutch
59,751
 34,064 25,687
c.
Spontaneous telephone dialogues (recorded via a switchboard)
270,027
68,886
201,141
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
 6,257
0
  e.
Simulated business negotiations
25,485
 0  25,485
  f. Interviews/discussions/debates (broadcast)
100,250
25,144  75,106
  g.
(political) Discussions/debates/meetings (non-broadcast)
34,126
9,009
 25,117
h.
Lessons recorded in the classroom
36,064
10,103
25,961
i.
Live (eg sports) commentaries (broadcast)
35,116
10,130  24,986
j.
Newsreports/reportages (broadcast)
32,744
 7,679  25,065
k.
News (broadcast)
32,601
 7,305  25,296
l.
Commentaries/columns/reviews (broadcast)
32,502
 7,431  25,071
m.
Ceremonious speeches/sermons
7,077
1,893  5,184
n.
Lectures/seminars
23,056
 8,143  14,913
o.
Read speech 135,071   64,848 70,223
Total
1,007,254
331,837 675,417

Return to the top of this page.


Frequency information

A frequency list was derived from the manually verified data available in version 1.0 The list gives information about the frequency of the occurrence of a phonetic transcription given a certain orthographic instance in this part of the corpus. A description can be found  on  ../../lexicon/freq_lst.htm. The frequency list itself (fonalph.frq) can be found in the directory /data/lexicon/freqlists/ of the annotation DVD.
 
 

Return to the top of this page.