Word segmentation

The entire corpus was automatically segmented on the level of the word, that is, for each word in the corpus an alignment with the speech signal is made in terms of points in time. The segmentation is also available on phoneme level when the segmentation was based on an automatic phonetic transcription. The part of the corpus for which a manually verified broad phonetic transcription was available, was also segmented on the level of the word and this segmentation was manually verified. For this part the phoneme segmentations are not available.

Below information is given on the goals, the protocol, the procedure ,and the file types and formats. Finally, an overview is presented of the data that are available in version 1.0 of the corpus.

Read more about




Aim and motivation

The main goal of this annotation layer is to separate words acoustically by means of marks in the speech signal. These marks must be set such a way that the speech signal they confine should contain the word and nothing more than this word. The separated words should ideally be recognised as such and should sound acoustically acceptable.

The word segmentation is useful for quick access to the data, when one wants to hear the acoustic realisation of a certain word. Besides this, the manually checked word segmentation can be considered a reliable source for training an automatic speech recognizer for which the first segmentation already exists. Finally, the word segmentation establishes a one-to-one link between the orthographic words and its phonetic counterpart. The link is established in terms of markings in the speech signal. For the part of the corpus that was enriched with a manual verification of the word segmentation, a manual broad phonetic transcriptions was available as well.

Return to the top of this page.


Procedure

For each phoneme in the either manually or automatically created broad phonetic transcription an automatic speech recognizer links it to an interval in the speech signal that corresponds to that phoneme. The word segmentation were derived from these phoneme segments. More information about the procedure can be found in Martens et al. (2002).

Only when the phonetic transcriptions were created automatically (cf. Demuynck et al. 2002 and Cucchiarini et al. 2001), are the original phoneme segmentations also available. A part of the data received a manual phonetic transcription (see here). It is only for these data that manually verified word segmentations are available.

For the manual verification of the word segmentation  PRAAT was used. PRAAT allows you to both see (and play) the speech signal and the transcription tiers in which, in this case, words are displayed separated by markers. These markers can easily be dragged to the right position (if necessary) by using the mouse.
 

References:

Martens, J.P. , D. Binnenpoorte, K. Demuynck, R. van Parys, T. Laureys, W. Goedertier & J. Duchateau 2002. Word Segmentation in the Spoken Dutch Corpus, in Proceedings of LREC2002, Las Palmas de Gran Canaria, Spain.

Demuynck, K., T. Laureys & S. Gillis. 2002. Automatic Generation of Phonetic Transcriptions for Large Speech Corpora.  In Proceedings International Conference on Spoken Language Processing. Vol. 1: 333-336. Denver, USA.

Cucchiarini, C., D. Binnenpoorte & S. Goddijn. 2001. Phonetic Transcriptions in the Spoken Dutch Corpus: how to combine efficiency and good transcription quality. In Proceedings Eurospeech 2001. Aalborg, Denmark. pp. 1679-1682

Return to the top of this page.

Protocol

A protocol (Binnenpoorte, 2002) was written in order to make sure that the manual verification of the word segmentation happened at least as consistent as possible. In order to achieve this, several guidelines were formulated. The most important of these were:

The speech data in the corpus is characterized as continuous speech meaning that words are not separated from each other by pauses, unlike words in written text that are separated by spaces. Sometimes the continuous stream of sounds causes problems when trying to separate words. This happens when two words share phonemes at the end of the first word and the beginning of the second word. How to handle this and other problems is extensively discussed in the protocol.
 

Binnenpoorte, D. 2002. PProtocol voor manuele verificatie van automatisch gegenereerde woordsegmentaties. (Available here in .ps and .pdf format; Dutch only.)

Return to the top of this page.

File types and formats

The word segmentation files are stored in the following way:
 


An extensive description of the abovementioned file formats can be found in wrd format and the awd format, the bpt format and the skp formats
 

Return to the top of this page.


Overview of available data

In Table 1 an overview is presented of the data that are available in version 1.0. For a more detailed description of the corpus design and the motivation for this design, we refer to the corpus design and motivation.
 

Table 1. Overview of available data (VL = data originating from Flanders; NL = data originating from the Netherlands)
 
Component Total number 
of words 
VL
NL
a.
Spontaneous conversations ('face-to-face')
177,127
 70,945 106,182
b.
Interviews with teachers of Dutch
59,751
 34,064 25,687
c.
Spontaneous telephone dialogues (recorded via a switchboard)
270,027
68,886
201,141
  d.
Spontaneous telephone dialogues (recorded on MD via a local interface)
6,257
 6,257
0
  e.
Simulated business negotiations
25,485
 0  25,485
  f. Interviews/discussions/debates (broadcast)
100,250
25,144  75,106
  g.
(political) Discussions/debates/meetings (non-broadcast)
34,126
9,009
 25,117
h.
Lessons recorded in the classroom
36,064
10,103
25,961
i.
Live (eg sports) commentaries (broadcast)
35,116
10,130  24,986
j.
Newsreports/reportages (broadcast)
32,744
 7,679  25,065
k.
News (broadcast)
32,601
 7,305  25,296
l.
Commentaries/columns/reviews (broadcast)
32,502
 7,431  25,071
m.
Ceremonious speeches/sermons
7,077
1,893  5,184
n.
Lectures/seminars
23,056
 8,143  14,913
o.
Read speech 135,071   64,848 70,223
Total
1,007,254
331,837 675,417

For all data in the corpus also an automatic word segmentation is available, including the phoneme segmentation. Information about the amount of data and their characteristics can be found in the table onder orthographic transcription.
 
 

Return to the top of this page.