Orthographic transcription

All the recorded material was transcribed orthographically. The orthographic transcription is a verbatim record of what was actually said. In the transcription process repetitions, hesitations, false starts and such were transcribed. Background noise, on the other hand, was seldom represented in the transcriptions.

Below the role of the orthographic transcription is discussed in some detail, as are the aims that were pursued. Attention is also given to the protocol that was developed and the procedures that were followed. Information is included about the file types and formats that were used. Finally, an overview is presented of the data that are available in version 1.0.

Read more about



Aim and motivation

The aim of the orthographic transcription of data in the Spoken Dutch Corpus was two-fold. First of all, it served to provide users with a simple symbolic representation of the audio file. By means of this representation it is easy to navigate through the corpus, is it possible to derive frequency information, etc. The orthographic transcription is one of the few transcription/annotations that are available for the entire corpus. Moreover, the transcription has been checked manually. Secondly, the orthographic transcription formed the basis for all other transcriptions and annotations.

In view of the importance of the orthographic transcription, at the beginning of the project a great deal of attention was devoted to giving thought to what the nature of the orthographic transcription ought to be and formulating the principles underlying the transcription process. An account of various considerations that were weighted can be found in the protocol for orthographic transcriptions. The following principles were adopted:

Return to the top of this page.


Procedure

In order to facilitate the transcription process, use was made of the PRAAT software that was developed by Paul Boersma at the University of Amsterdam. In PRAAT it is not only possible to play the recording and to visualize the signal, it is also possible to produce and view orthographic transcriptions. For each speaker a separate tier is available.

During the process of transcription in the speech signal short segments of approx. 3 seconds were indicated by means of markers. These markers were placed in naturally occurring pauses between words (please note that the places where these markers occur do not necessarily coincide with syntactic boundaries). Later on the markers were used as anchor points for the automatic segmentation.
 

Return to the top of this page.


Protocol

In view of the principles that were adopted (see above) and the time and money available, a number of criteria were established that formed the basis for the Protocol voor orthografische transcriptie (Goedertier & Goddijn 2000; here available in .ps and .pdf format; Dutch only). These are

Consistency
The experience gained in a number of other projects (eg Switchboard, SpeechDat) is that it is advisable to maintain standard spelling conventions. This is generally easier for the transcribers, while it also contributes to the degree of consistency. Therefore, in the Spoken Dutch Corpus project standard spelling conventions were used. However, in order to further increase consistency in the transcriptions, in a number of cases it was decided to deviate from standard conventions. This is for instance the case for the use of punctuation marks and the use of capital and small letters.

In order to obtain a transcription that would be as consistent as possible. the spelling of (known) words was checked on-line during the transcription process by means of a spell checker. If an error was detected, the transcriber was supposed to correct the error or to mark it with one of the special symbols that had been specified in the protocol. Thus special symbols were defined for new (ie as yet unknown) words, but also for incomplete words, dialect words, etc. The marked words were validated by a lexicographer and then added to the lexicon.

Accuracy
The procedure for producing an orthographic transcription was set up so as to yield as accurate a transcription as possible. After one transcriber had made a first transcription, a second transcriber would check this transcription. This would involve checking the correctness of the transcription (was everything that was said fully and correctly represented in the transcription, had the speech been attributed to the correct speaker(s), etc.)

The accuracy of the orthographic transcription was subjected to further checks as the data were passed on to receive further transcriptions and annotations. Whenever an error was detected, a bug report was filed. Then the transcription was checked once more and the error corrected.

Transparency
It has been attempted to keep to number of rules in the protocol down to a minimum. This makes it easier for transcribers to memorize them and to apply them correctly. In the protocol not only the rules for transcription have been included, but also a great many examples. As the protocol was being developed, the experiences gained by the transcribers were also taken into account. As a result the protocol has proven to be practicable.
 

References

Return to the top of this page.


File types and formats

The orthographic transcriptions are available in two formats:

For a detailed description of these formats, see the descriptions of the ort format, the pri format and the skp format.

Files in the TextGrid format are of the type .ort. These files can be found in the directory /data/annot/text/ort/ of the annotation DVD.
Files in the XML format can be found in the directories /data/annot/xml/pri/ and /data/annot/xml/skp/ of the annotation DVD.
 

Return to the top of this page.


Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.
 

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)
 
Component Total number 
of words
VL
NL
a.
Spontaneous conversations ('face-to-face')
2,626,172
 878,383 1,747,789
b.
Interviews with teachers of Dutch
565,433
 315,554 249,879
c.
Spontaneous telephone dialogues (recorded via a switchboard)
1,208,633
465,096
743,537
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
853,371
 343,167
510,204
  e.
Simulated business negotiations
136,461
 0  136,461
  f. Interviews/discussions/debates (broadcast)
790,269
250,708  539,561
  g.
(political) Discussions/debates/meetings (non-broadcast)
360,328
138,819
 221,509
h.
Lessons recorded in the classroom
405,409
105,436
299,973
i.
Live (eg sports) commentaries (broadcast)
208,399
 78,022  130,377
j.
Newsreports/reportages (broadcast)
186,072
 95,206  90,866
k.
News (broadcast)
368,153
 82,855  285,298
l.
Commentaries/columns/reviews (broadcast)
145,553
 65,386  80,167
m.
Ceremonious speeches/sermons
18,075
 12,510  5,565
n.
Lectures/seminars
140,901
 79,067  61,834
o.
Read speech  903,043 351,419 551,624
Total
8,916,272
3,261,628 5,654,644

 
 

Return to the top of this page.




Word frequency lists

On the basis of the data that are available in this release various word frequency lists were compiled. The different lists are the following:

A description of the different lists can be found on ../../lexicon/freq_lst.htm. The frequency lists can be found in the directory /data/lexicon/ of the annotation DVD.
 
Return to the top of this page.