{ "cells": [ { "cell_type": "markdown", "id": "chemical-amsterdam", "metadata": {}, "source": [ "TD 4 - Data analysis\n", "----------------------------\n", "\n", "In this notebook, we manipulate some basic statistical notions using python libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "answering-statement", "metadata": {}, "outputs": [], "source": [ "import pandas\n", "import numpy as np\n", "import scipy, scipy.stats\n", "import random\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "noted-amber", "metadata": {}, "source": [ "### Compositionality\n", "\n", "The compositionality dataset below comes from the experiments in compositionaliyty prediction described in [this paper](https://aclanthology.org/J19-1001/). We will focus on the column called _compositionality_ which contains average annotations on a scale from 0 to 5 by about 15-20 human judges per compound noun, on a set of 180 compound nouns in French. The details of the construction of this dataset can be found [here](https://aclanthology.org/P16-2026/). The dataset contains also many other columns that we may explore later, including automatic compositionality predictions.\n", "\n", "##### Reading the data\n", "\n", "We will read the full dataset from a tab-separated table file using Pandas, a very useful python library for data analysis." ] }, { "cell_type": "code", "execution_count": 2, "id": "major-orbit", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | compound_lemma | \n", "compositionality | \n", "
---|---|---|
134 | \n", "poule_mouillé | \n", "0.0000 | \n", "
127 | \n", "pied_noir | \n", "0.1333 | \n", "
19 | \n", "carte_blanc | \n", "0.2000 | \n", "
151 | \n", "septième_ciel | \n", "0.2143 | \n", "
15 | \n", "bouc_émissaire | \n", "0.2308 | \n", "
... | \n", "... | \n", "... | \n", "
0 | \n", "activité_physique | \n", "4.9333 | \n", "
55 | \n", "eau_potable | \n", "5.0000 | \n", "
170 | \n", "téléphone_portable | \n", "5.0000 | \n", "
96 | \n", "matière_gras | \n", "5.0000 | \n", "
52 | \n", "eau_chaud | \n", "5.0000 | \n", "
180 rows × 2 columns
\n", "\n", " | compositionality | \n", "freq.w1&w2 | \n", "
---|---|---|
0 | \n", "4.9333 | \n", "13292 | \n", "
1 | \n", "3.6000 | \n", "19681 | \n", "
2 | \n", "4.6000 | \n", "14437 | \n", "
3 | \n", "3.6364 | \n", "540 | \n", "
4 | \n", "1.3077 | \n", "1259 | \n", "
... | \n", "... | \n", "... | \n", "
175 | \n", "3.8000 | \n", "10067 | \n", "
176 | \n", "4.6923 | \n", "5529 | \n", "
177 | \n", "4.3571 | \n", "1395 | \n", "
178 | \n", "3.9231 | \n", "19950 | \n", "
179 | \n", "3.2000 | \n", "859 | \n", "
180 rows × 2 columns
\n", "