Compositionality of Nominal Compounds - Datasets

Authors: Silvio Cordeiro, Carlos Ramisch, Aline Villavicencio, Leonardo Zilio, Marco Idiart, Rodrigo Wilkens
Version 2.0 - August 16, 2022

Download
Older version (did not include extra set)

Description

This package contains numerical judgements by native speakers on the compositionality of 190 nominal compound in English (EN), 180 nominal compounds in French (FR), and 180 nominal compounds in Brazilian Portuguese (PT). The English data is split into two parts. The original 90 English compounds were annotated to complement the 90 compounds in the Reddy dataset (see below). The "extra" 100 English compounds were annotated to perform generalisation experiments in the Computational Linguistics paper (Section 6.3).

Judgements were obtained using Amazon Mechanical Turk (EN and FR) and a web interface for volunteers (PT). Every compound has 3 scores: compositionality of head word, compositionality of modifier word and compositionality of the whole. Scores range from 1 (fully idiomatic) to 5 (fully compositonal) and are averaged over several annotators (around 10 to 20 depending on the language). All compounds also have synonyms and similar expressions given by annotators.

The datasets are described in detail and used in the experiments of papers below. Please cite one of them if you use this material in your research.

Our methodology is inspired from Reddy, McCarthy and Manandhar (2011). We include their set of 90 compounds and judgments in our dataset for the analyses and experiments on English in our papers above. However, we do not include their dataset here, though. Please also download their data and cite their paper to obtain a fully comparable English dataset to the one used in our experiments.

Quick start

If you only want to use our datasets to evaluate your compositionality prediction models, you're probably interested in the scores present in the column named compositionality of files:

annotations/unfiltered/en.unfiltered.csv
annotations/unfiltered/en-extra.unfiltered.csv
annotations/unfiltered/fr.unfiltered.csv
annotations/unfiltered/pt.unfiltered.csv

Folders

annotations: results, including raw files with individual annotations, averaged unfiltered and averaged filtered data, as well as quality metrics and distribution graphics (see filtering parameters below)
bin: scripts used to filter and estimate the quality of datasets (MWE workshop paper)
compounds-lists: contains the list of compounds and auxiliary information (gender, number, example sentences, etc) given to replace placeholders in MTurk questionnaire (in csv format for FR and EN) or used to create the dynamic HTML annotation interface (in MySQL database format for PT)
questionnaires: MTurk and HTML interfaces used in data collection. FR interface in HTML is included, but the data in this package comes from MTurk.

Files post-processing

The following commands were executes in order to create most files in the annotations folder.

# _Unfiltered averaged files used in ACL long and short paper_

for lg in en en-extra fr pt; do
  ../bin/filter-answers.py --zscore-thresh=10000000 --spearman-thresh=-1 \
    --batch-file raw/$lg.raw.csv --lang ${lg:0:2} \
    > unfiltered/$lg.unfiltered.csv \
    2> unfiltered/$lg.unfiltered.log
done

# _Generation of filtered averaged files used in MWE workshop paper_

for lg in en en-extra pt; do # fr has different thresholds
  ../bin/filter-answers.py --zscore-thresh=2.2 --spearman-thresh=0.5 \
    --batch-file raw/$lg.raw.csv --lang ${lg:0:2} \
    > filtered/$lg.filtered.csv 2> filtered/$lg.filtered.log
done    
../bin/filter-answers.py --zscore-thresh=2.5 --spearman-thresh=0.5 \
  --batch-file raw/fr.raw.csv --lang fr > filtered/fr.filtered.csv \
  2> filtered/fr.filtered.log

# _Generation of graphics and evaluation of datasets _
mkdir -p graphics
for lg in en en-extra fr pt; do
  for f in unfiltered filtered; do
    ../bin/intrinsic-quality-dataset.py --avg-file $f/$lg.$f.csv \
      2> quality/$lg.$f.quality
      mv $f/*.pdf graphics
  done
done

Note : Data may differ slightly from papers because we added some new annotations since the papers were written. The 100 compounds in the EN-extra dataset were only used in the Computational Linguistics paper but we provide filtering analyses here for comparison.

LexSubNC - Lexical Substitution of Nominal Compounds in Portuguese

Rodrigo Wilkens, Leonardo Zilio, Silvio Cordeiro, Felipe S. F. Paula, Carlos Ramisch, Marco Idiart, Aline Villavicencio
Version 1.0 - September 20, 2017
Download the data set

Description

This package is an extension of the original compositionality datasets and includes more detailed annotation for Portuguese lexical substitution candidates in the original dataset. It contains the same 180 nominal compounds in Portuguese as the compositionality dataset. It additionally contains frequency and PMI from a large Brazilian Portuguese corpos (around 1.2 billion words), as well as lexical substitutes annotated according to the following categories:

Invalid: the substitution candidate is not fit for substitution, either for being too specific for a given context or for simply not being valid for the target MWE.
Syn-SW: the substitution candidate is a single-word matching synonym in relation to the target MWE.
NearSyn-SW: the substitution candidate is a single-word quasi-synonym in relation to the target MWE.
Syn-MWE: the substitution candidate is a multiword matching synonym in relation to the target MWE.
NearSyn-MWE: the substitution candidate is a multiword quasi-synonym in relation to the target MWE.
Paraphrase: the substitution candidate is a paraphrasis of the target MWE.
Definition: the substitution candidate is a definition of the target MWE.
Head
Modifier

The lexical substitutes were provided by volunteer native speaker annotators, who were requested to provide suggestions of substitution candidates for the compounds in context. The suggestions from all annotators were pooled together and sorted according to their frequency. This pool was then manually categorized by a linguist, who attributed categories to each different substitution candidate.

The folder contains the following files:

LexSubNC-compounds.csv: list of 180 compounds and their information (lemmas, frequency, PMI, compositionality). The last columns contain the histogram of equivalents in each category
LexSubNC-equivalents.csv: list of equivalents (lexical substitutes) and their corresponding annotations, according to the categories above. The number of times each substitute was suggested is also provided (equivalent_freq)
LexSubNC.ods: same as the previous two files in Libreoffice format, one sheet per file

The details about this data can be found in our IWCS 2017 paper:

LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds

Note : Data may differ slightly from papers because we added some new annotations since the papers were written.

Carlos Ramisch's personal webpage

Compositionality of Nominal Compounds - Datasets

Description

Quick start

Folders

Files post-processing

LexSubNC - Lexical Substitution of Nominal Compounds in Portuguese

Description