Corpora and features for cross-lingual UD parsing - Datasets

Authors: Manon Scholivet, Franck Dary, Alexis Nasr, Benoit Favre, Carlos Ramisch
Version 1.0 - April 4, 2019
Download the data set

Corpora

Our corpora are based on the CoNLL 2017 shared task data, and are balanced on the number of tokens of each language for the training and development. The test copus is simply the concatenation of all test treebanks available for the 40 languages.

The first 10 columns correspond to the conllu format (without the text and sent_id meta-data comments), the eleventh column contains the code of the original treebank from which the sentence was taken.

WALS matrices

The 4 matrices files are in csv format. They were extracted from the World Atlas of Language Structures (WALS). The header of each matrix contains the WALS features of each column, and each line contains W(l), the vector representation of the language l according to the WALS.

When a vector contains -1, it means the feature Wi was unspecified for the language l. The 2 files named matrix_W(n|80)_replaceValue contain the same matrix filled with the value of the nearest neighbour to replace missing values.

The feature values are encoded using positive integers, as follows:

81A: Order of Subject, Object and Verb
- 1 : SOV
- 2 : SVO
- 3 : VSO
- 4 : VOS
- 5 : OVS
- 6 : OSV
- 7 : No dominant order
82A: Order of Subject and Verb
- 1: SV
- 2: VS
- 3: No dominant order
83A: Order of Object and Verb
- 1: OV
- 2: VO
- 3: No dominant order
85A: Order of Adposition and Noun Phrase
- 1: Postpositions
- 2: Prepositions
- 3: Inpositions
- 4: No dominant order
- 5: No adpositions
86A: Order of Genitive and Noun
- 1: Genitive-Noun
- 2: Noun-Genitive
- 3: No dominant order
87A: Order of Adjective and Noun
- 1: Adjective-Noun
- 2: Noun-Adjective
- 3: No dominant order
- 4: Only internally-headed relative clauses
88A: Order of Demonstrative and Noun
- 1: Demonstrative-Noun
- 2: Noun-Demonstrative
- 3: Demonstrative prefix
- 4: Demonstrative suffix
- 5: Demonstrative before and after Noun
- 6: Mixed
89A: Order of Numeral and Noun
- 1: NumN
- 2: NNum
- 3: Both orders of numeral and noun with neither order dominant
- 4: Numeral only modifies verb
90A: Order of Relative Clause and Noun
- 1: NRel
- 2: RelN
- 3: Internally-headed relative clause
- 4: Correlative relative clause
- 5: Adjoined relative clause
- 6: Double-headed relative clause
- 7: Mixed types of relative clause with none dominant
92A: Position of Polar Question Particles
- 1: Initial
- 2: Final
- 3: Second position
- 4: Other position
- 5: In either of two positions
- 6: No Question particle
95A: Relationship between the Order of Object and Verb and the Order of Adposition and Noun Phrase
- 1: OV & Postpositions
- 2: OV & Prepositions
- 3: VO & Postpositions
- 4: VO & Prepositions
- 5: Other
96A: Relationship between the Order of Object and Verb and the Order of Relative Clause and Noun
- 1: OV & RelN
- 2: OV & NRel
- 3: VO & RelN
- 4: VO & NRel
- 5: Other
97A: Relationship between the Order of Object and Verb and the Order of Adjective and Noun
- 1: OV & AdjN
- 2: OV & NAdj
- 3: VO & AdjN
- 4: VO & NAdj
- 5: Other
101A: Expression of Pronominal Subjects
- 1: Pronominal subjects are expressed by pronouns in subject position that are normally if not obligatorily present
- 2: Pronominal subjects are expressed by affixes on verbs
- 3: Pronominal subjects are expressed by clitics with variable host
- 4: Pronominal subjects are expressed by subject pronouns that occur in a different syntactic position from full noun phrase subjects
- 5: Pronominal subjects are expressed only by pronouns in subject position, but these pronouns are often left out
- 6: More than one of the above types with none dominant
112A: Negative Morphemes
- 1: Negative affix
- 2: Negative particle
- 3: Negative auxiliary verb
- 4: Negative word, unclear if verb or particle
- 5: Variation between negative word and affix
- 6: Double negation
116A: Polar Question
- 1: Question particle
- 2: Interrogative verb morphology
- 3: Question particle and interrogative verb morphology
- 4: Interrogative word order
- 5: Absence of declarative morphemes
- 6: Interrogative intonation only
- 7: No interrogative-declarative distinction
143A: Order of Negative Morpheme and Verb
- 1: NegV
- 2: VNeg
- 3: [Neg-V]
- 4: [V-Neg]
- 5: Negative Tone
- 6: Type 1 / Type 2
- 7: Type 1 / Type 3
- 8: Type 1 / Type 4
- 9: Type 2 / Type 3
- 10: Type 2 / Type 4
- 11: Type 3 / Type 4
- 12: Type 3 / Negative Infix
- 13: Optional Single Negation
- 14: Obligatory Double Negation
- 15: Optional Double Negation
- 16: Optional Triple Negation with Obligatory Double Negation
- 17: Optional Triple Negation with Optional Double Negation
143E: Preverbal Negative Morphemes
- 1: Preverbal negative word
- 2: Negative prefix
- 3: Both preverbal negative word and negative prefix
- 4: No preverbal negative morpheme
143F: Postverbal Negative Morphemes
- 1: Postverbal negative word
- 2: Negative suffix
- 3: Both postverbal negative word and negative suffix
- 4: No postverbal negative morpheme
143G: Minor morphological means of signaling negation
- 1: Negative tone
- 2: Negative infix
- 3: Negative stem change
- 4: No negative tone, infix or stem change
144A: Position of Negative Word With Respect to Subject, Object, and Verb
- 1: NegSVO
- 2: SNegVO
- 3: SVNegO
- 4: SVONeg
- 5: NegSOV
- 6: SNegOV
- 7: SONegV
- 8: SOVNeg
- 9: NegVSO
- 10: VSNegO
- 11: VSONeg
- 12: NegVOS
- 13: ONegVS
- 14: ONegVS
- 15: OSVNeg
- 16: More than one position for negative morpheme, with none dominant
- 17: Optional single negation
- 18: Obligatory double negation
- 19: Optional double negation
- 20: Morphological negation only (but not double negation)
- 21: Other language

BASIC features used in the parser

The SIGMA.fm file describes the features used in the BASIC parser configuration.

Examples of the meaning of each feature:

b.-2.POS: the POS of the word in the buffer at position -2 relatively to the current buffer head.
s.1.ldep.LABEL: the DEPREL label of the word in the stack at position 1 (immediately below the top) relatively to its closest left dependent.
tc.0: the previous transition predicted by the parser's classifier.
s.0.GOV: the relative position of the word predicted as the head of the top of the stack, relatively to the position of the word at the top of the stack.
b.0.M1: the content of the buffer's M1 feature, which is the first morphological feature, at the position of the buffer's head.

Morphological features

The morphological features used for parsing were selected according to their frequency and to the number of languages that represent them in the treebanks, as detailed in the paper. The selected UD morphological features are:

M1 : Number
M2 : Case
M3 : VerbForm
M4 : Person
M5 : Mood
M6 : Tense
M7 : PronType
M8 : NumType
M9 : Polarity
M10 : Gender
M11 : Voice
M12 : Degree
M13 : Reflex
M14 : Poss
M15 : Definite
M16 : Aspect

Using and citing

The datasets are described in detail and used in the experiments of the paper below. Please cite it if you use this material in your research.

Manon Scholivet, Franck Dary, Alexis Nasr, Benoit Favre, Carlos Ramisch, "Typological Features for Multilingual Delexicalised Dependency Parsing", Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2019. [BiBTeX]

@inproceedings{p:scholivet-etAl:2019:naacl,
	title = {Typological Features for Multilingual Delexicalised Dependency Parsing},
        author = {Manon Scholivet and Franck Dary and Alexis Nasr and Benoit Favre and Carlos Ramisch},
	booktitle = {Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019)},
        year = {2019},
	address = {Minneapolis, MN, USA},       
}