The Spoken Dutch Corpus project

The Spoken Dutch Corpus project

The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. In version 1.0, the results are presented that have emerged from the project. The total number of words available here is nearly 9 million (800 hours of speech). Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands.

The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting-point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available.

Parts of the corpus have already been made available in the course of the project through a number of intermediate releases. The present release replaces all earlier releases.

The project was funded by the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO). The total budget was some 4.9 million euros. The Dutch Language Union (Nederlandse Taalunie) holds all rights. It is therefore not permitted to reproduce or make public part(s) of the data in any fashion without prior written permission from the Dutch Language Union.

The corpus may be used for scientific research and the development of non-commercial products. In these products the contributions made by indivual speakers may not be present in such a fashion that these can be identified. With a commercial licence the corpus may be used for the development of commercial derivative products such as speech recognizers and language models. In these products the contributions made by indivual speakers may not be present in such a fashion that these can be identified.

Read more about

the background to and motivation for the project
the project organisation
the work packages (corpus design, recording and digitization, orthographic transcription, lemmatisation, POS tagging, lexicon link-up, broad phonetic transcription, word segmentation, syntactic annotation, prosodic annotation, the COREX corpus exploitation software)
the time table
the dissemination of the results
the results of the mid-term evaluation
the publications that have emerged from the project

Finally, by way of illustration three short samples have been included here that form part of the corpus. See under demo.

Background and motivation

Standard Dutch is the official language in the Netherlands (some 15 million people speak northern standard Dutch), in Flanders (the northern part of Belgium, about 5.6 million people speak southern standard Dutch), in Surinam (some 360,000 speakers, about 50 per cent of which live in the Netherlands) and the Dutch Antilles (some 240,000 speakers). While variants of the same language, there are considerable differences between northern standard Dutch and southern standard Dutch. These differences occur with regard to syntax, morphology, lexis and phonetics/phonology.

As one of the smaller languages in Europe, Dutch is under serious threat of gradually disappearing as it is losing ground to English. The availability of the necessary resources has placed the English language and speech technology in the leading position it holds today and has further strengthened the position of English for business communication. The fact that for Dutch few relevant resources were available has formed a serious complication for the advancement of Dutch language and speech technology. The Spoken Dutch Corpus project sought to ameliorate this situation.

Apart from the interests held by language and speech technologists, the corpus is intended to serve several other research interests. For one, the corpus is of importance to linguists from various backgrounds. So far for Dutch only written text corpora were available. As a consequence, Dutch descriptive linguistics has focused on the written language, while there is as yet hardly any systematic knowledge of the much more evasive spoken form of the language. A corpus of spoken Dutch is also relevant for teaching. A thorough knowledge of everyday language use is essential to the development of course materials for the teaching of Dutch as a second language as well as for teaching Dutch in primary and secondary schools.

Project organisation

The Spoken Dutch Corpus project was directed by a board whose members included representatives of the two governments, the Dutch Language Union, the Dutch and Flemish research foundations, and the Nijmegen Max Planck Institute (MPI). At the start of the project, Prof. W.J.M. Levelt was Chairman of the board. When he stepped down, he was succeeded by Prof. S. Nooteboom of the Landelijke Onderzoekschool Taalkunde (LOT), while Prof. W. Vonk of the MPI (also KUN) became a board member.

Appointed by the board there was a steering committee which consisted of experts from various linguistic (sub)disciplines and expert language and speech technologists, that was responsible for the project's progress and finances.

The project was coordinated at two national coordinating sites: Ghent for Flanders and Nijmegen for the Netherlands. Each site was directed by a project manager. The project managers in collaboration with three specialist working groups (one for corpus design, one for signal processing, and one for corpus annotation were responsible for the design and implementation of the various work packages. Thus the working group for corpus design was responsible for the design and composition of the corpus, speaker recruitment and the acquisition of recordings. The working group signal processing was in charge of the development of the protocol for orthographic transcription, and the development and implementation of the procedures that were used. In a similar fashion, they were involved in word segmentation, phonetic transcription and prosodic annotation. Finally, the working group for corpus annotation was responsible for POS tagging, lemmatisation, lexicon link-up and syntactic annotation, while the development of the Spoken Dutch Corpus lexicon also resided under this working group.

Apart from the aforementioned working groups, a number of expert committees were appointed. These had an advisory role with regard to the following matters: the development of corpus exploitation software, evaluation, and prosodic annotation.

The project's secretariat provided general organizational and administrative services.

A user group was been set up whose principal role was to monitor and critically assess the design and implementation of procedures and protocols and to evaluate (intermediate) results.

Work packages

The project aimed to compile a ten-million-word corpus that would constitute a plausible sample of contemporary standard Dutch as spoken in Flanders and the Netherlands. One third of the data was to be collected in Flanders, while two thirds were to originate from the Netherlands. The entire corpus was to be orthographically transcribed and tagged for part-of-speech information. In addition, for a selection of one million words more advanced transcriptions and annotations were envisaged.

The following work packages were distinguished:

Corpus design and compilation

More information on the design of the corpus and its motivation can be found here.

Recording and digitization

Recordings were made by people working for the project or - as for example in the case of spontaneously spoken dialogues - by volunteers who had kindly agreed to record conversations conducted in their homes. Recordings were also obtained through collaboration with other projects, companies, organizations, and institutions. All recordings were digitized. With the exception of the telephone conversations, all material was stored in an uncompressed 16 bit, 16 kHz wav format (for more information on the wav format, see here). Information about the recording conditions, the equipment used, etc. is available as part of the meta-data.

It is possible to play the audio files in wav format by means of the PRAAT or the COREX software. Alternatively, most other audio players, both on pc and other platforms, can handle the wav files. Both PRAAT and COREX make it possible for users to play the recordings and at the same time view the orthographic transcript.

Orthographic transcription

Of all recordings a verbatim transcript is made. The transcripts conform to a large extent to standard spelling conventions. A protocol (Goedertier & Goddijn, 2000; here available in .ps and .pdf format; Dutch only) has been developed which describes in detail what to describe and how to deal with new words, dialect, mispronunciations, and so on. Background noises are not represented in the transcript.

To facilitate the transcription process, use was made of the interactive signal processing tool PRAAT that was developed by Paul Boersma at the University of Amsterdam. In PRAAT it is possible to listen to and visualize the speech signal and at the same time create and view the orthographic transcript. Each speaker was assigned his or her own tier.

During the transcription process, transcribers segmented the audio files in relatively short chunks (of approx. 2 to 3 seconds each) by inserting time markers in unfilled pauses between words. At a later stage these markers were used as anchor points for the automatic alignment of the transcript and the speech file.
For more information about the orthographic transcription, see ../version_1.0/annot/orthography/info.htm

(Photograph: D. van Aalst, KUN)g

Lemmatization and POS tagging

The entire corpus has been tagged for part-of-speech information. Within the project a tagset was defined which consists of 316 tags. The tagset closely follows the Algemene Nederlandse Spraakkunst (ANS, the authoritative Dutch reference grammar; Haeseryn et al., 1997). The tagset conforms to the EAGLES guidelines. A description can be found in Van Eynde (2003; here available in .pdf format; Dutch only).
Tagging was done by means of the POS tagger that was developed at the University of Tilburg. The tagger was used to assign the most likely tag to each word in the text. The output of the tagger was then checked (manually) and where necessary corrected. Apart from the POS tag, for each word also the associated lemma is given. A lemmatizer was used to automatically associate with each token the appropriate lemma. The result was manually checked and corrected.

For more information about the POS tagging, see ../version_1.0/annot/pos tagging/info.htm
For more information about the lemmatisation, see ../version_1.0/annot/lemmatisation/info.htm

Lexicon link-up

Within the project a lexicon has been developed. The lexicon has played an important role in the transcription and annotation of the corpus, while it also constitutes a possible way of accessing the data. The lexicon link-up made it possible to realize a more advanced form of lemmatization in which the constituent parts of split verbs and foreign multiword units were considered jointly and a lemma was associated with the combination as a whole. The protocol that was used (Piepenbrock 2004) is available here in .ps and .pdf format (Dutch only).

For more information, see ../version_1.0/annot/lex linkup/info.htm

Broad phonetic transcription

For about one million words a (verified) broad phonetic transcription is available. The protocol that has been used (Gillis 2001) is available here in .ps and .pdf format (Dutch only). The transcriptions were made using the PRAAT program.

For more information about the broad phonetic transcription, see ../version_1.0/annot/phonetics/info.htm

(Photograph: D. van Aalst, KUN)

Word segmentation

For all data for which a verified phonetic transcription is available, the speech file has been aligned with the orthograhic transcript ong the level of the word. The output has been checked manually and corrected. The protocol (Binnenpoorte 2002, 2004) is available here in .ps and .pdf format (Dutch only). For the remainder of the material the speech files and transcripts has been aligned automatically, while there has been no manual verification.

For more information about the word segmentation, see ../version_1.0/annot/word align/info.htm

Syntactic annotation

For the syntactic annotation of a selection of one million words an annotation scheme has been developed. Before annotation of the corpus was begun, the scheme was applied to several test samples. This led to some adaptations after which the protocol was finalised. Syntactic annotation was carried out semi-automatically, using the Annotate software. The protocol that has been used (Hoekstra et al. 2003) is available here in .pdf format (Dutch only).

For more information about the syntactic annotation, see ../version_1.0/annot/syntax/info.htm

Prosodic annotation

For approximately 250,000 words a prosodic annotation is available. The annotation encompasses the identification of the most important phrase boundaries as well as the one or two most important words (sentence accents) of each phrase. The protocol that has been used (Martens 2003) is available here in .ps and .pdf format (Dutch only).

For more information about the prosodic annotation, see ../version_1.0/annot/prosody/info.htm

Development of exploitation software

Within the Spoken Dutch Corpus project exploitation software was developed by the technical group at the MPI. The software provides easy and efficient access to the data.

For more information (incl. documentation) about the exploitation software, see ../corex/info.htm

Timetable

The Spoken Dutch Corpus project ran for some five years. The official starting-date was 1 June 1998. During the first year much time was invested in corpus design, the development of various protocols (especially for making the recordings, signal processing, the archiving of the data, the orthographic transcription and the broad phonetic transcription) and the selection and adaptation of tools and resources (such as the lexicon). The corpus was compiled incrementally. The project ended 1 March 2004.

Dissemination of the results

Parts of the corpus were already made available in the course of the project through a number of intermediate releases. The present release (version 1.0) of the complete corpus replaces all previous releases. Version 1.0 includes the results of the Spoken Dutch Corpus project as they were available when the project on 1 March 2004 when the project ended. In all, it comprises 33 DVDs, 32 of which contain the sound files that are part of the corpus.

The distribution of the corpus - inclusive of the recordings - is handled by the Dutch HLT Centre (TST-Centrale).

Mid-term evaluation

In October 2001 the Spoken Dutch Corpus was subjected to a mid-term evaluation. The results of this evaluation can be found in the evaluation report that is available here in .ps and .pdf format.

Publications

Apart from the various protocols and working documents that were produced during the project, there are also various publications that have been published. In these papers various aspects of the design and annotation of the corpus are discussed in some detail. For an overview we refer to the list of publications.

Demo

In order to illustrate the wide variation in speech that occurs in the corpus, here some samples have been included that you can listen to and inspect.

The orthographic transcript of each sample is presented here simply as text without any reference to the speaker. The time markers that occur in the orthographic transcript and that link the transcript with the sound file have been indicated by means of two vertical bars (||). In order to play the audio files, click on.

It is also possible to inspect the part-of-speech tagging and lemmatization of the samples. In order to do so, click on .

Available samples:g

record announcement on a local radio station
sports commentary (soccer) on a local radio station
sports news on a local radio station

Sample 1

Sample description	Record announcement on a local radio station
ID	fn000010
Orthographic transcription	Alwaysv Havev Andv Alwaysv Willv de nieuwe Acev Ofv Basev \|\| en hij klinkt nog leuk het is wel een beetje een ABBA-achtig \|\| -typign uh liedje weet je wel ja ABBA-achtig -typign. \|\| ja nou zo is misschien verkeerde woordkeuze \|\| maar uhm zo ga 'k het wel noemen \|\| achtign en typign.
Audio format	16 bits, 16 kHz wav format (mono)
File size	479 kB
Duration	15.29 sec
Play audio
POS tags and lemmas

Sample 2

Sample description Sports commentary (soccer) on a local radio station

ID fn000024

Orthographic transcription ja 't was in de eenentwintigste minuut. || toen uh brak op de rechterkant Bas Schaaij door || hij omspeelde z'n man mooi legde de bal terug. || in de zestien meter kwam uh Rikken || binnengelopen die werd aangetikt || tenminste zo oordeelde scheidsrechter uh Tempelaar. || hij gaf daarvoor een strafschop een gele paart*u || gele kaart voor Felibor Peters || en die werd uh de strafschop werd verzilverd || door uh Luc Van Raaij. || en nog geen twee minuten later || was de bal uh in één keer werd ie diep gegeven en || werd er wederom gescoord. || nu was het Mario Lammers met een uh knap afstandsschot. || het is dus nul twee voor Hatert.

Audio format 16 bits, 16 kHZ wav format (mono)

File size 937 kB

Duration 29.96 sec

Play audio

POS tags and lemmas

Sample 3

Sample description Sports news on a local radio station

ID fn000040

Orthographic transcription in het VU-ziekenhuis in Amsterdam || is Danny Blind vanmiddag aan zijn rechter knie onderzocht. || er is gebleken dat er geen verder verlies || van het kraakbeen is opgetreden. || wel is er wat irritatie aan de binnenmeniscus geconstateerd. || er zal per wedstrijd dan ook worden gekeken || of Danny Blind vanaf nu inzetbaar is || maar er werd eerst gevreesd voor helemaal nooit meer kunnen spelen.|| nou dat valt dus gelukkig mee. || ben 'k erg blij mee persoonlijk als fan || zullen we maar zeggen. ||

Audio format 16 bits, 16 kHZ wav format (mono)

File size 714 kB

Duration 22.84 sec

Play audio

POS tags and lemmas

Sample description	Sports commentary (soccer) on a local radio station
ID	fn000024
Orthographic transcription	ja 't was in de eenentwintigste minuut. \|\| toen uh brak op de rechterkant Bas Schaaij door \|\| hij omspeelde z'n man mooi legde de bal terug. \|\| in de zestien meter kwam uh Rikken \|\| binnengelopen die werd aangetikt \|\| tenminste zo oordeelde scheidsrechter uh Tempelaar. \|\| hij gaf daarvoor een strafschop een gele paart*u \|\| gele kaart voor Felibor Peters \|\| en die werd uh de strafschop werd verzilverd \|\| door uh Luc Van Raaij. \|\| en nog geen twee minuten later \|\| was de bal uh in één keer werd ie diep gegeven en \|\| werd er wederom gescoord. \|\| nu was het Mario Lammers met een uh knap afstandsschot. \|\| het is dus nul twee voor Hatert.
Audio format	16 bits, 16 kHZ wav format (mono)
File size	937 kB
Duration	29.96 sec
Play audio
POS tags and lemmas

Sample description	Sports news on a local radio station
ID	fn000040
Orthographic transcription	in het VU-ziekenhuis in Amsterdam \|\| is Danny Blind vanmiddag aan zijn rechter knie onderzocht. \|\| er is gebleken dat er geen verder verlies \|\| van het kraakbeen is opgetreden. \|\| wel is er wat irritatie aan de binnenmeniscus geconstateerd. \|\| er zal per wedstrijd dan ook worden gekeken \|\| of Danny Blind vanaf nu inzetbaar is \|\| maar er werd eerst gevreesd voor helemaal nooit meer kunnen spelen.\|\| nou dat valt dus gelukkig mee. \|\| ben 'k erg blij mee persoonlijk als fan \|\| zullen we maar zeggen. \|\|
Audio format	16 bits, 16 kHZ wav format (mono)
File size	714 kB
Duration	22.84 sec
Play audio
POS tags and lemmas