The .syn format

Files of type .syn comprise the syntactically annotated data and can be found in @nnotate software that uses the NeGra annotation format. Below a description of the format can be found. For more dedtailed information about @nnotate and the NeGra format we refer to the @nnotate-website. The .syn format also has an XML variant that can be found in /data/annot/xml/tig of the annotation DVD and which is described in the format description of .tig.

%% sample fn123456
%%
#FORMAT 3
...
%% word	tag	morph	edge	parent	secedge	comment
#BOS 8 ...
welke	VNW11	U521b	DET	500
films	N2	T107	HD	500
hebben	WW2	T302	HD	501
zij	VNW1	U501u	SU	501
?	LET	T007	--	0
#500	NP	--	WHD	502	OBJ1	501
#501	SV1	--	BODY	502
#502	WHQ	--	--	0
#EOS
...
%% 432 sentences (2530 tokens, 926 phrases)

Each .syn file contains a three-line header. Lines with comments are introduced by two percent signs ('%%'). The first line indicates the sample number ('%% sample fn123456') and is followed by an empty comment line. In the third line the NeGra format version is indicated ('#FORMAT 3'). Then the first sentence follows. Each sentence is preceded by a comment line which repeats the field names ('%% word tag morph...'), followed by a BEGIN_OF_SENTENCE ('#BOS 8...'). The first digit immediately following #BOS is the rank number of the sentence. Each sentence is concluded by means of END_OF_SENTENCE ('#EOS').

In the first field (the 'word' field) in the case of a terminal node the word form can be found, while in case a non-terminal node is concerned this field contains a node number. In the second field (the 'tag' field) in the case of terminal node the POS tag (part-of-speech) can be found, while in case a non-terminal node is concerned it contains the node label (syntactic category). With regard to the POS tags it should be observed that these are not the (official) POS tags that have beeb assigned in the POS tagging; rather the tags that are used here is a derived set. The number of official POS tags is so large that the parser would need far too much data for training. Therefore a simpler set has been used here. The official POS tag can be found in the third field that is not used by the parser (the 'morph' field). In the fourth field (the 'edge' field) the edge label can be found, ie the name of the syntactic function in the constituent immediately dominating the current node. The number of the mother node can be found in the fifth field (the 'parent' field). This number refers to the first field (but then on another line), where the mother node is described. Some constituents, eg relative NPs, have a double syntactic function, one in the constituent in which they presently occur, and one from which they have been displaced. The name of the syntactic function of such a constituent in its original position is indicated in the sixth field (the 'secedge' field), while in the seventh field (the 'comment' field) the function of the constituent in its current position is given.

In the example above the first line describes the left-most terminal node of the interrogative sentence 'welke films hebben zij?'. In the first field the word form 'welke' occurs, in the second the (simplified) POS tag of 'welke': 'VNW11', in the thirdt the official POS tag (currently not used): 'U521b', in the fourth the syntactic function of 'welke' within the NP 'welke films': 'DET' and in the fifth the number of the NP 'welke films': '500'. Node 500 in turn is described in the sixth line. The first field records the number: '500', the second the syntactic category: 'NP', the third field is empty (since it concerns a non-terminal node), the fourth mentions the syntactic function of the NP in node 502 (i.e. the entire sentence): 'WHD' (i.e. complementiser/head of an interrogative sentence), the fifth denotes the number of the node in which the NP serves this function: '502', the sixth contains the syntactic function that the NP has in node 501 (i.e. the subclause from which it was displaced) and in the seventh the number of the node can be found in which the NP occurs in this function: '501'.

The last line of the .syn file comprises statistical information about the sample ('%% 432 sentences...'), i.e. a count of the number of sentencesw, the number of token/words, and the number of phrases in the sample.

In the text file /data/annot/text/syn/negraheader.txt of the annotation DVD the full set of labels can be found including the full set of POS tags.