Compositionality of Nominal Compounds - Datasets

Description

This package contains numerical judgements by native speakers on the compositionality of 190 nominal compound in English (EN), 180 nominal compounds in French (FR), and 180 nominal compounds in Brazilian Portuguese (PT). The English data is split into two parts. The original 90 English compounds were annotated to complement the 90 compounds in the Reddy dataset (see below). The "extra" 100 English compounds were annotated to perform generalisation experiments in the Computational Linguistics paper (Section 6.3).

Judgements were obtained using Amazon Mechanical Turk (EN and FR) and a web interface for volunteers (PT). Every compound has 3 scores: compositionality of head word, compositionality of modifier word and compositionality of the whole. Scores range from 1 (fully idiomatic) to 5 (fully compositonal) and are averaged over several annotators (around 10 to 20 depending on the language). All compounds also have synonyms and similar expressions given by annotators.

The datasets are described in detail and used in the experiments of papers below. Please cite one of them if you use this material in your research.

Our methodology is inspired from Reddy, McCarthy and Manandhar (2011). We include their set of 90 compounds and judgments in our dataset for the analyses and experiments on English in our papers above. However, we do not include their dataset here, though. Please also download their data and cite their paper to obtain a fully comparable English dataset to the one used in our experiments.

Quick start

If you only want to use our datasets to evaluate your compositionality prediction models, you're probably interested in the scores present in the column named compositionality of files:

Folders

Files post-processing

The following commands were executes in order to create most files in the annotations folder.

# _Unfiltered averaged files used in ACL long and short paper_

for lg in en en-extra fr pt; do
  ../bin/filter-answers.py --zscore-thresh=10000000 --spearman-thresh=-1 \
    --batch-file raw/$lg.raw.csv --lang ${lg:0:2} \
    > unfiltered/$lg.unfiltered.csv \
    2> unfiltered/$lg.unfiltered.log
done

# _Generation of filtered averaged files used in MWE workshop paper_

for lg in en en-extra pt; do # fr has different thresholds
  ../bin/filter-answers.py --zscore-thresh=2.2 --spearman-thresh=0.5 \
    --batch-file raw/$lg.raw.csv --lang ${lg:0:2} \
    > filtered/$lg.filtered.csv 2> filtered/$lg.filtered.log
done    
../bin/filter-answers.py --zscore-thresh=2.5 --spearman-thresh=0.5 \
  --batch-file raw/fr.raw.csv --lang fr > filtered/fr.filtered.csv \
  2> filtered/fr.filtered.log

# _Generation of graphics and evaluation of datasets _
mkdir -p graphics
for lg in en en-extra fr pt; do
  for f in unfiltered filtered; do
    ../bin/intrinsic-quality-dataset.py --avg-file $f/$lg.$f.csv \
      2> quality/$lg.$f.quality
      mv $f/*.pdf graphics
  done
done

Note : Data may differ slightly from papers because we added some new annotations since the papers were written. The 100 compounds in the EN-extra dataset were only used in the Computational Linguistics paper but we provide filtering analyses here for comparison.


LexSubNC - Lexical Substitution of Nominal Compounds in Portuguese

Description

This package is an extension of the original compositionality datasets and includes more detailed annotation for Portuguese lexical substitution candidates in the original dataset. It contains the same 180 nominal compounds in Portuguese as the compositionality dataset. It additionally contains frequency and PMI from a large Brazilian Portuguese corpos (around 1.2 billion words), as well as lexical substitutes annotated according to the following categories:

The lexical substitutes were provided by volunteer native speaker annotators, who were requested to provide suggestions of substitution candidates for the compounds in context. The suggestions from all annotators were pooled together and sorted according to their frequency. This pool was then manually categorized by a linguist, who attributed categories to each different substitution candidate.

The folder contains the following files:

The details about this data can be found in our IWCS 2017 paper:

Note : Data may differ slightly from papers because we added some new annotations since the papers were written.