{ "cells": [ { "cell_type": "markdown", "id": "chemical-amsterdam", "metadata": {}, "source": [ "TD 4 - Data analysis\n", "----------------------------\n", "\n", "In this notebook, we manipulate some basic statistical notions using python libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "answering-statement", "metadata": {}, "outputs": [], "source": [ "import pandas\n", "import numpy as np\n", "import scipy, scipy.stats\n", "import random\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "noted-amber", "metadata": {}, "source": [ "### Compositionality\n", "\n", "The compositionality dataset below comes from the experiments in compositionaliyty prediction described in [this paper](https://aclanthology.org/J19-1001/). We will focus on the column called _compositionality_ which contains average annotations on a scale from 0 to 5 by about 15-20 human judges per compound noun, on a set of 180 compound nouns in French. The details of the construction of this dataset can be found [here](https://aclanthology.org/P16-2026/). The dataset contains also many other columns that we may explore later, including automatic compositionality predictions.\n", "\n", "##### Reading the data\n", "\n", "We will read the full dataset from a tab-separated table file using Pandas, a very useful python library for data analysis." ] }, { "cell_type": "code", "execution_count": 2, "id": "major-orbit", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | compound_lemma | \n", "compositionality | \n", "
|---|---|---|
| 134 | \n", "poule_mouillé | \n", "0.0000 | \n", "
| 127 | \n", "pied_noir | \n", "0.1333 | \n", "
| 19 | \n", "carte_blanc | \n", "0.2000 | \n", "
| 151 | \n", "septième_ciel | \n", "0.2143 | \n", "
| 15 | \n", "bouc_émissaire | \n", "0.2308 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 0 | \n", "activité_physique | \n", "4.9333 | \n", "
| 55 | \n", "eau_potable | \n", "5.0000 | \n", "
| 170 | \n", "téléphone_portable | \n", "5.0000 | \n", "
| 96 | \n", "matière_gras | \n", "5.0000 | \n", "
| 52 | \n", "eau_chaud | \n", "5.0000 | \n", "
180 rows × 2 columns
\n", "| \n", " | compositionality | \n", "freq.w1&w2 | \n", "
|---|---|---|
| 0 | \n", "4.9333 | \n", "13292 | \n", "
| 1 | \n", "3.6000 | \n", "19681 | \n", "
| 2 | \n", "4.6000 | \n", "14437 | \n", "
| 3 | \n", "3.6364 | \n", "540 | \n", "
| 4 | \n", "1.3077 | \n", "1259 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 175 | \n", "3.8000 | \n", "10067 | \n", "
| 176 | \n", "4.6923 | \n", "5529 | \n", "
| 177 | \n", "4.3571 | \n", "1395 | \n", "
| 178 | \n", "3.9231 | \n", "19950 | \n", "
| 179 | \n", "3.2000 | \n", "859 | \n", "
180 rows × 2 columns
\n", "