Cindy Aloui and Alexis Nasr and Lucie Barque and Carlos Ramisch
This package contains the datasets used in the experiments of the submitted paper "SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings". SLICE is a hybrid model that combines supersense labels with contextual embeddings. We introduce a weakly supervised method to learn interpretable embeddings from raw corpora and a small lists of seed words. Our model is able to represent both a word and its context as embeddings into the same compact space, whose dimensions correspond to interpretable supersenses.
The data and code can be downloaded here: slice-data-scripts-20201101.zip
This package contains :
If you use SLICE, please cite the following paper:
@InProceedings{aloui-etAl-2020:coling,
authors = "Cindy Aloui and Alexis Nasr and Lucie Barque and Carlos Ramisch",
title = "SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings",
booktitle = "28th International Conference on Computational Linguistics (COLING 2020)",
year = "2020",
publisher = "ICCL",
}
seeds
: these are the lists of prototypical nouns used to pseudo-annotate the corpora. Seed nouns are given in a plain txt UTF-8 file, with one noun per line. For each supersense, we provide positive and negative seeds in *-all.txt
, and corresponding dev/train splits.frsemcor
: contains the FrSemCor corpus divided into training, dev and test sets according to the Universal Dependencies split of the Sequoia treebank. The file is in Extended CoNLL-U format with the last column containg the supersenses: FrSemCorFile lexsignatures/10000-nouns.lexsig.txt
contains the list of the 10K most frequent noun lemmas of the frWaC and their corresponding lexical signatures. Each lemma is provided on one line followed by its signature vector. TAB-separated Supersense scores are sorted as follows: ANI, NAT, MAN, INF, DYN, STA For instance, the word grue (crane) is represented as: grue 0.0836 0.6519 0.9458 0.0654 0.0619 0.1135
with scores close to 1 for NAT (0.65, the bird sense) and MAN (0.95, the machine sense), and close to 0 for other supersenses.
contextsignatures
contains the list of nouns in FrSemCor train, dev and test parts, with the corresponding context and lexical signatures. The nouns are given one per line, TAB-separated, as follows:lexsignatures
) sorted in the order aboveRequires Python3 + tensorflow
, keras
, HuggingFace's transformers
, conllu
and py torch
, all installable via pip3
. Run the scripts without any argument for help.
bin/decode
folder contains the code used to predict the context signatures based on a trained modelbin/train
folder contains the code used to train the classifiers using a list of seeds and a raw corpus.bin/wsd
folder contains the code used for the WSD experiments using