Corpora and features for cross-lingual UD parsing - Datasets

Corpora

Our corpora are based on the CoNLL 2017 shared task data, and are balanced on the number of tokens of each language for the training and development. The test copus is simply the concatenation of all test treebanks available for the 40 languages.

The first 10 columns correspond to the conllu format (without the text and sent_id meta-data comments), the eleventh column contains the code of the original treebank from which the sentence was taken.

WALS matrices

The 4 matrices files are in csv format. They were extracted from the World Atlas of Language Structures (WALS). The header of each matrix contains the WALS features of each column, and each line contains W(l), the vector representation of the language l according to the WALS.

When a vector contains -1, it means the feature Wi was unspecified for the language l. The 2 files named matrix_W(n|80)_replaceValue contain the same matrix filled with the value of the nearest neighbour to replace missing values.

The feature values are encoded using positive integers, as follows:

BASIC features used in the parser

The SIGMA.fm file describes the features used in the BASIC parser configuration.

Examples of the meaning of each feature:

Morphological features

The morphological features used for parsing were selected according to their frequency and to the number of languages that represent them in the treebanks, as detailed in the paper. The selected UD morphological features are:

Using and citing

The datasets are described in detail and used in the experiments of the paper below. Please cite it if you use this material in your research.