Carlos Ramisch's personal webpage

SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings

Cindy Aloui and Alexis Nasr and Lucie Barque and Carlos Ramisch

This package contains the datasets used in the experiments of the submitted paper "SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings". SLICE is a hybrid model that combines supersense labels with contextual embeddings. We introduce a weakly supervised method to learn interpretable embeddings from raw corpora and a small lists of seed words. Our model is able to represent both a word and its context as embeddings into the same compact space, whose dimensions correspond to interpretable supersenses.

The data and code can be downloaded here: slice-data-scripts-20201101.zip

This package contains :

The lists of seed nouns prototypical of each supersense
our evaluation corpus for WSD experiments
SLICE lexical signatures of the 10K most frequent nouns in the frWaC
SLICE context signatures for the nouns in FrSemCor
the code used to create SLICE and to evaluate it in WSD

If you use SLICE, please cite the following paper:

@InProceedings{aloui-etAl-2020:coling,
  authors = "Cindy Aloui and Alexis Nasr and Lucie Barque and Carlos Ramisch",
  title = "SLICE: Supersense-based Lightweight Interpretable Contextual Embeddings",
  booktitle = "28th International Conference on Computational Linguistics (COLING 2020)",
  year = "2020",
  publisher = "ICCL",  
}

Seed lists

Folder seeds: these are the lists of prototypical nouns used to pseudo-annotate the corpora. Seed nouns are given in a plain txt UTF-8 file, with one noun per line. For each supersense, we provide positive and negative seeds in *-all.txt, and corresponding dev/train splits.

Evaluation corpus for WSD

Folder frsemcor: contains the FrSemCor corpus divided into training, dev and test sets according to the Universal Dependencies split of the Sequoia treebank. The file is in Extended CoNLL-U format with the last column containg the supersenses: FrSemCor

Lexical and context signatures

File lexsignatures/10000-nouns.lexsig.txt contains the list of the 10K most frequent noun lemmas of the frWaC and their corresponding lexical signatures. Each lemma is provided on one line followed by its signature vector. TAB-separated Supersense scores are sorted as follows: ANI, NAT, MAN, INF, DYN, STA For instance, the word grue (crane) is represented as: grue 0.0836 0.6519 0.9458 0.0654 0.0619 0.1135 with scores close to 1 for NAT (0.65, the bird sense) and MAN (0.95, the machine sense), and close to 0 for other supersenses.
Folder contextsignatures contains the list of nouns in FrSemCor train, dev and test parts, with the corresponding context and lexical signatures. The nouns are given one per line, TAB-separated, as follows:

sentence ID
noun lemma
reference class (1=ANI, 2=NAT, 3=MAN, 4=INF, 5=DYN, 6=STA)
6 contextual scores sorted in the order above
6 lexical scores (from lexsignatures) sorted in the order above

Code

Requires Python3 + tensorflow, keras, HuggingFace's transformers, conllu and py torch, all installable via pip3. Run the scripts without any argument for help.

The bin/decode folder contains the code used to predict the context signatures based on a trained model
The bin/train folder contains the code used to train the classifiers using a list of seeds and a raw corpus.
The bin/wsd folder contains the code used for the WSD experiments using
1. the MLP on SLICE, (b) FlauBERT embeddings that can be given to the MLP and (c) the most frequent supersense baseline.