Meta-data: sample information

In the file recordings.xls on the annotation DVD information is available about the samples that are part of the corpus.

The information that can be found in the successive columns in recordings.xls is as follows:

recordingID
id-number of the sample: fnNNNNNN or / fvNNNNNN, eg. fn000110 / fv4000028. For all samples that originate from the Netherlands the id starts with the letters fn; for all samples from Flanders the id starts with the letters fv
aXtype
specifies the type of header info: TEXT
creator
specifies who is responsible for this header: CLS-KUN (for the Dutch data) or ELIS-UG (for the Flemish data)
version
the current version of the header info: HEADER.version1.0
aXupdate
date the header information was last updated
info
information about the type of recording; eg spontaneous conversation ('face-to-face'), Television programme: Studio Sport, ceremonious speech: opening of the academic year
respType
type of transcription/annotation: SAMPLING
respName
group responsible for the transcription or annotation in preceding column: SPEX, CNTS-UA, ELIS-UG, or ESAT-KUL
respType
type of transcription/annotation: ORTHOGRAPHIC TRANSCRIPTION
respName
group responsible for the transcription or annotation in preceding column: SPEX, CNTS-UA, ELIS-UG, or ESAT-KUL
respType
type of transcription/annotation: PART-OF-SPEECH TAGGING
respName
group responsible for transcription or annotation in preceding column: CLS-KUN or CCL-KUL
respType
type of transcription/annotation: LEMMATISATION
respName
group responsible for transcription or annotation in preceding column: CLS-KUN or CCL-KUL
respType
type of transcription/annotation: LEXICON LINK-UP
respName
group responsible for transcription or annotation in preceding column: : CLS-KUN or CCL-KUL
respType
type of transcription/annotation: WORD SEGMENTATION
respName
group responsible for transcription or annotation in preceding column: CLS-KUN, ELSI-UG, or ESAT-KUL
respType
type of transcription/annotation: PHONETIC TRANSCRIPTION
respName
group responsible for transcription or annotation in preceding column: SPEX or CNTS-UA
respType
type of transcription/annotation: SYNTACTIC ANNOTATION
respName
group responsible for transcription or annotation in preceding column: OTS or CCL-KUL
respType
type of transcription/annotation: PROSODIC ANNOTATION
respName
group responsible for transcription or annotation in preceding column: UvT/RUL or CNTS-UA/ELIS-UG
wordCount
number of words in the sample: number
secCount
duration of the sample expressed in the total number of seconds: number
byteCount
indication of the size of the .wav file (expressed in terms of a number of units): number
unit
MB
extNote
notes regarding the sample: none
wph
average number of words per hour: number
distributor
organisation responsible for the distribution: ELDA
WAV-DVD
label of the dvd on which the sound file can be found; eg CGN_WAV_01
author
author of the book the text of which was read aloud: first name or initial(s), surname
biblStringXtitle
title of the book: title
pubName
publisher’s name: name
pubPlace
place of publication: place
pubDate
year of publication: year
rexXdate
recording date: date or year
time
time of recording (optional)
source
indication of source from which recording originated: eg national television, Draadomroep, library for the blind, etc.
producer
producer of the recording: CGN, VNC, Corpus van der Wijst, ANP Radio, etc.
target
gives information about four aspects: text type, degree of preparedness, mode, and domain;
text type: specifies the component to which a sample belongs; 15 text types are distinguished; tta-tto (see list below)
degree of preparedness: prep1 = scripted, prep2 = unscripted, prep3 = more-or-less scripted;
mode: mod1 = broadcast, radio; mod2 = broadcast, tv; mod3 = non-broadcast
domain: dom1 = private; dom2= public
term
one or more keywords that characterize the subject matter in the sample
speakerIDs
the identification code(s) of the speaker(s) that occur(s) in the sample: N..... / V....., eg. N00023 / V00023
role
role(s) of the speaker(s) in the specific sample: eg interviewer, interviewee, chairman, contact, interlocutor, lecturer, news-reader, reporter, teacher, pupil, etc. Note: For the Dutch material in components (text types) tta, ttc and ttd the role information and relation information was mixed up. For these components the relation information is erroneously listed as role information.
age
age class to which speaker belonged at the time the sample was recorded; age0 = under 18 years of age; age1 = 18-24 years of age; age2 = 25-34 years of age; age3 = 35 -44 years of age; age4 = 45-55 years of age; age5 = over 55 years of age; ageX = age unknown
interactionXtype
degree of interaction; it1 = no interaction; it2 = some interaction; it3 = full interaction; it4 = not applicable
interactionXactive
number of active speakers: number
interactionXpassive
passive speakers; yes, no, unknown, not used
relationXactive
relationship(s) between the speakers in a sample.Two main categories are distinguished: family relationships and social relationships. Family relationships that are distinguished are FAM: couple, FAM: parent, FAM: siblings, FAM: in-laws, FAM: other. Social relationships that are distinguished are SOC:friends, SOC: acquaintances, SOC: neighbours, and SOC: colleagues.
relationXpassive
relationship(s) of passive speakers present
aXdesc
description of the role(s) of the passive speaker(s); not used
mutual
relationship(s) between active and passive speakers; not used
locName
place where the recording was made represented in terms of the (reduced) postal code or description of place in which the recording was made; possibly unknown or unspecified.
locale
description of the type of space in which the recording was made: loc1 = room of average size; loc2 = open air; loc3 =public place; loc4 = large room; unspecified
activity
short description of activity speaker(s) was (were) involved in at the time of recording
recMediumXtype
recording medium: Mini Disk, DAT Tape, CAS tape, CD ROM, computer, video, audio CD, unspecified
microphoneXtype
type of microphone used: eg ECM-MS907
micDistanceXperson
distance to microphone expressed in terms of number of cm: number or number-number (eg 100, 50-100)
dist
distance between speakers
cm
measure
noise
indication of background noise
recording
nature of recording: analog (DIG1), digital (DIG2) or unspecified (unspecified)
processing
processing of the recording: DIG1, DIG2, unspecified
status
final status of recording: DIG2

The last columns of the table are meant to record when changes were made to what type of annotation and by whom. For each annotation layer three columns have been reserved to include this information. These are:

revDate
date the change was made
revType
type of annotation that has been revised: sampling, orthographic transcription, POS tagging, lemmatisation, lexicon link-up, word segmentation, phonetic transcription, syntactic annotation, prosodic annotation
revName
name of group/institute responsible

Note: In future one may want to add a fourth column, recording exactly what change was made:
revChange
description of the change that was made: description

Text types:

tta spontaneous conversations (face-to-face)
ttb interviews with teachers of Dutch
ttc spontaneous telephone dialogues (recorded via a switchboard)
ttd spontaneous telephone dialogues (recorded on MD with local interface)
tte simulated business negotiations
ttf interviews/discussions/debates (broadcast)
ttg (political) discussions/debates/meetings (non-broadcast)
tth lessons recorded in a classroom
tti live (eg sport) commentaries (broadcast)
ttj newsreports/reportages (broadcast)
ttk news (broadcast)
ttl commentaries/columns/reviews (broadcast)
ttm ceremonious speeches/sermons
ttn lectures/seminars
tto read speech