#Détection de la langue d'un texte

**Le but de ce projet est de programmer un classifieur permettant de détecter la langue d'un texte à partir de la fréquence des bigrammes de ce dernier.**

## Préparation des données

On récupère les corpus de d'apprentissage et de test

In [1]:
%%bash
rm train.tgz
wget http://pageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/MASCO_Apprentissage_Automatique/train.tgz
tar xvfz train.tgz


train/
train/uk_iu-ud-train.txt
train/zh_gsd-ud-train.txt
train/la_ittb-ud-train.txt
train/af_afribooms-ud-train.txt
train/be_hse-ud-train.txt
train/cs_cac-ud-train.txt
train/fro_srcmf-ud-train.txt
train/hsb_ufal-ud-train.txt
train/mt_mudt-ud-train.txt
train/mr_ufal-ud-train.txt
train/en_lines-ud-train.txt
train/fr_sequoia-ud-train.txt
train/cu_proiel-ud-train.txt
train/fi_ftb-ud-train.txt
train/ro_rrt-ud-train.txt
train/sv_talbanken-ud-train.txt
train/hi_hdtb-ud-train.txt
train/en_esl-ud-train.txt
train/pl_sz-ud-train.txt
train/cs_pdt-ud-train.txt
train/bg_btb-ud-train.txt
train/el_gdt-ud-train.txt
train/fr_partut-ud-train.txt
train/sme_giella-ud-train.txt
train/da_ddt-ud-train.txt
train/qhe_hiencs-ud-train.txt
train/lv_lvtb-ud-train.txt
train/no_nynorsk-ud-train.txt
train/cs_fictree-ud-train.txt
train/en_ewt-ud-train.txt
train/ga_idt-ud-train.txt
train/fr_spoken-ud-train.txt
train/ca_ancora-ud-train.txt
train/sl_sst-ud-train.txt
train/ur_udtb-ud-train.txt
train/got_proiel-ud-train.tx

rm: cannot remove 'train.tgz': No such file or directory
--2021-01-31 20:32:31--  http://pageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/MASCO_Apprentissage_Automatique/train.tgz
Resolving pageperso.lif.univ-mrs.fr (pageperso.lif.univ-mrs.fr)... 139.124.22.27
Connecting to pageperso.lif.univ-mrs.fr (pageperso.lif.univ-mrs.fr)|139.124.22.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29281190 (28M) [application/x-gzip]
Saving to: ‘train.tgz’

     0K .......... .......... .......... .......... ..........  0%  241K 1m58s
    50K .......... .......... .......... .......... ..........  0%  473K 89s
   100K .......... .......... .......... .......... ..........  0%  132M 59s
   150K .......... .......... .......... .......... ..........  0%  473K 59s
   200K .......... .......... .......... .......... ..........  0% 54.5M 48s
   250K .......... .......... .......... .......... ..........  1%  487K 49s
   300K .......... .......... .......... .......... ..........  1%

**On fait la liste des fichiers dont on veut se servir pour construire les données d'apprentissage.**
**A chaque corpus on associe un identificateur de la langue (en, fr, it ...)**

In [2]:
l_corpus_train=[
['en', './train/en_partut-ud-train.txt'],
['fr', './train/fr_sequoia-ud-train.txt'],
['it', './train/it_partut-ud-train.txt'],
['nl', './train/nl_lassysmall-ud-train.txt'],
['sl', './train/sl_sst-ud-train.txt'],
['es', './train/es_ancora-ud-train.txt'],
['pt', './train/pt_bosque-ud-train.txt'],
['de', './train/de_gsd-ud-train.txt'],
['ca', './train/ca_ancora-ud-train.txt']
]

**On construit un dictionnaire qui associe à chaque langue un identifiant numérique**

In [3]:
def calculeCodeLangues(l_corpus):
    nbLangues = 0
    codeLangue = { }

    for corpus in l_corpus:
        idLangue = corpus[0]
        fichierCorpus = corpus[1]
        if not idLangue in codeLangue :
          print('langue :', idLangue, 'code = ', nbLangues)
          codeLangue[idLangue] = nbLangues
          nbLangues += 1
    return codeLangue
  
codeLangues = calculeCodeLangues(l_corpus_train)
  

langue : en code =  0
langue : fr code =  1
langue : it code =  2
langue : nl code =  3
langue : sl code =  4
langue : es code =  5
langue : pt code =  6
langue : de code =  7
langue : ca code =  8


**On extrait des corpus d'apprentissage des bigrammes dont on calcule la fréquence.**
**Les fréquences de bigrammes sont stockées dans un fichier dont le format est le suivant :**
**chaque ligne se présente sous la forme de l'identificateur d'une langue, suivie de la fréquence des différents bigrammes dans un ordre fixé (voir la fonction bigram_code())**

In [4]:
import random
import sys

def calculFreq(bigrammes):
    somme = 0
    for elt in bigrammes :
        somme += elt
    i = 0
    while i < len(bigrammes):
        bigrammes[i] /= somme
        i = i + 1

def afficheBigrammes(bigrammes, fic):
    for frequence in bigrammes :
        print(frequence, ' ', file = fic, end='')
    print('', file = fic)

def char_code(c):
    if 'a' <= c and c <= 'z':
        return ord(c) - ord('a')
    elif c == ' ':
        return 26
    else :
        return 27
    

def bigram_code(c1, c2):
    return 27 * char_code(c1) + char_code(c2)

def process_corpus_random(corpus, maxBigrammes, maxTirage):
    try:
        fic = open(corpus, 'r')
    except IOError:
        print("le fichier", corpus, "n'existe pas")
        return None

    corpus_str = fic.read()
    longueur_corpus = len(corpus_str)
#    print("longueur corpus = ", longueur_corpus)
    fic.close()
    
    l_bigrammes = []
    tirage = 0
    while tirage < maxTirage:
        bigrammes = [0] * 784
        nbBigrammes = 0
        while nbBigrammes <= maxBigrammes :
            position = random.randint(0,longueur_corpus - 2)
            code_bigramme = bigram_code(corpus_str[position], corpus_str[position + 1]) 
            bigrammes[code_bigramme] += 1
            nbBigrammes += 1
        calculFreq(bigrammes)
        l_bigrammes.append(bigrammes)
        tirage += 1
    return l_bigrammes

def process_corpus_sequential(corpus, maxBigrammes, maxTirage):
    try:
        fic = open(corpus, 'r')
    except IOError:
        print("le fichier", corpus, "n'existe pas")
        return None

    corpus_str = fic.read()
    longueur_corpus = len(corpus_str)
#    print("longueur corpus = ", longueur_corpus)
    fic.close()
    
    l_bigrammes = []
    tirage = 0
    while tirage < maxTirage:
        bigrammes = [0] * 784
        nbBigrammes = 0
        position = random.randint(0,longueur_corpus - maxBigrammes - 2)
        while nbBigrammes <= maxBigrammes :
            code_bigramme = bigram_code(corpus_str[position], corpus_str[position + 1]) 
            bigrammes[code_bigramme] += 1
            nbBigrammes += 1
            position += 1
            
        calculFreq(bigrammes)
        l_bigrammes.append(bigrammes)
        tirage += 1
    return l_bigrammes

def extract_bigrams(l_corpus, mode, nbBigrammes, nbTirages, fichierSortie) :
    try:
        ficOut = open(fichierSortie, 'w')
    except IOError:
        print("le fichier", fichierSortie, "n'existe pas")
        exit

    for corpus in l_corpus:
        idLangue = corpus[0]
        fichierCorpus = corpus[1]
        print('traite corpus', fichierCorpus)
        if(mode == 'random'):
            l_bigrammes = process_corpus_random(fichierCorpus, nbBigrammes, nbTirages)
        else :
            l_bigrammes = process_corpus_sequential(fichierCorpus, nbBigrammes, nbTirages)
        for bigrammes in l_bigrammes:
            print(idLangue, ' ', file = ficOut, end='')
            afficheBigrammes(bigrammes, ficOut)
    ficOut.close()
    
extract_bigrams(l_corpus_train, 'random', 100, 100, 'train.dat')

traite corpus ./train/en_partut-ud-train.txt
traite corpus ./train/fr_sequoia-ud-train.txt
traite corpus ./train/it_partut-ud-train.txt
traite corpus ./train/nl_lassysmall-ud-train.txt
traite corpus ./train/sl_sst-ud-train.txt
traite corpus ./train/es_ancora-ud-train.txt
traite corpus ./train/pt_bosque-ud-train.txt
traite corpus ./train/de_gsd-ud-train.txt
traite corpus ./train/ca_ancora-ud-train.txt


**On met en forme les données de manière à pouvoir les fournir au réseau de neurones pour l'apprentissage**

In [5]:
import numpy as np
import sys

def lectureDonnees(nomFichier, codeLangue):
    try:
        fic = open(nomFichier, 'r')
    except IOError:
        print("le fichier", nomFichier, "n'existe pas")
        return None

    nbLangues = len(codeLangue.keys()) 
    lx = []
    langues = []
    for ligne in fic:
        ligne = ligne.strip('\n\r')
        liste = ligne.split()
        langue = liste.pop(0)
        langues.append(codeLangue[langue])
        resultat = [float(x) for x in liste]
        lx.append(resultat)
    fic.close()

    ly = []
    for i in range(len(langues)):
        v = [0] * nbLangues
        v[langues[i]] = 1
        ly.append(v)

    import random
    t = list(zip(lx, ly))
    random.shuffle(t)
    lx, ly = list(zip(*t))

    x_train = np.array(lx, dtype="float")
    y_train = np.array(ly, dtype="int")
    print(len(x_train), "exemples lus")
    #print(x_train)
    return (x_train, y_train)



(x_train, y_train) = lectureDonnees('train.dat', codeLangues)


900 exemples lus


**On construit la structure du réseau et on fait l'apprentissage**

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

model = Sequential()
nbLangues = len(codeLangues.keys()) 

print('nbLangues =', nbLangues)

model.add(Dense(units=100, activation='tanh', input_dim=28*28))
#model.add(Dropout(0.5))
model.add(Dense(units=nbLangues, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# x_train and y_train are Numpy arrays --just like in the Scikit-Learn API.
model.fit(x_train, y_train, epochs=20, batch_size=16, validation_split=0.2)



nbLangues = 9
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f30c3e899b0>

** On charge les données de test**

In [7]:
%%bash
rm test.tgz
wget http://pageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/MASCO_Apprentissage_Automatique/test.tgz
tar xvfz test.tgz


test/
test/cs_fictree-ud-test.txt
test/akk_pisandub-ud-test.txt
test/la_ittb-ud-test.txt
test/gl_ctg-ud-test.txt
test/bg_btb-ud-test.txt
test/sv_talbanken-ud-test.txt
test/hi_hdtb-ud-test.txt
test/fi_pud-ud-test.txt
test/et_edt-ud-test.txt
test/ga_idt-ud-test.txt
test/da_ddt-ud-test.txt
test/br_keb-ud-test.txt
test/ja_pud-ud-test.txt
test/cs_pdt-ud-test.txt
test/pl_lfg-ud-test.txt
test/no_nynorsk-ud-test.txt
test/pt_bosque-ud-test.txt
test/id_pud-ud-test.txt
test/bm_crb-ud-test.txt
test/kpv_ikdp-ud-test.txt
test/ar_padt-ud-test.txt
test/kk_ktb-ud-test.txt
test/ar_nyuad-ud-test.txt
test/hy_armtdp-ud-test.txt
test/fi_ftb-ud-test.txt
test/it_postwita-ud-test.txt
test/fa_seraji-ud-test.txt
test/en_partut-ud-test.txt
test/zh_pud-ud-test.txt
test/cs_pud-ud-test.txt
test/sk_snk-ud-test.txt
test/tr_imst-ud-test.txt
test/eu_bdt-ud-test.txt
test/got_proiel-ud-test.txt
test/ug_udt-ud-test.txt
test/kpv_lattice-ud-test.txt
test/ru_taiga-ud-test.txt
test/ru_syntagrus-ud-test.txt
test/wbp_ufal-ud-tes

rm: cannot remove 'test.tgz': No such file or directory
--2021-01-31 20:32:43--  http://pageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/MASCO_Apprentissage_Automatique/test.tgz
Resolving pageperso.lif.univ-mrs.fr (pageperso.lif.univ-mrs.fr)... 139.124.22.27
Connecting to pageperso.lif.univ-mrs.fr (pageperso.lif.univ-mrs.fr)|139.124.22.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5342409 (5.1M) [application/x-gzip]
Saving to: ‘test.tgz’

     0K .......... .......... .......... .......... ..........  0%  233K 22s
    50K .......... .......... .......... .......... ..........  1%  472K 16s
   100K .......... .......... .......... .......... ..........  2%  485K 14s
   150K .......... .......... .......... .......... ..........  3% 51.6M 11s
   200K .......... .......... .......... .......... ..........  4% 81.7M 8s
   250K .......... .......... .......... .......... ..........  5% 79.2M 7s
   300K .......... .......... .......... .......... ..........  6%  503K 

In [8]:
l_corpus_test=[
['en', './test/en_partut-ud-test.txt'],
['fr', './test/fr_sequoia-ud-test.txt'],
['it', './test/it_partut-ud-test.txt'],
['nl', './test/nl_lassysmall-ud-test.txt'],
['sl', './test/sl_sst-ud-test.txt'],
['es', './test/es_ancora-ud-test.txt'],
['pt', './test/pt_bosque-ud-test.txt'],
['de', './test/de_gsd-ud-test.txt'],
['ca', './test/ca_ancora-ud-test.txt']
]

In [9]:
extract_bigrams(l_corpus_test, 'random', 500, 10, 'test.dat')
(x_test, y_test) = lectureDonnees('test.dat', codeLangues)

traite corpus ./test/en_partut-ud-test.txt
traite corpus ./test/fr_sequoia-ud-test.txt
traite corpus ./test/it_partut-ud-test.txt
traite corpus ./test/nl_lassysmall-ud-test.txt
traite corpus ./test/sl_sst-ud-test.txt
traite corpus ./test/es_ancora-ud-test.txt
traite corpus ./test/pt_bosque-ud-test.txt
traite corpus ./test/de_gsd-ud-test.txt
traite corpus ./test/ca_ancora-ud-test.txt
90 exemples lus


**On évalue le modèle sur les données de test**

In [10]:
l_id = [''] * len(codeLangues)
for (key, val) in (codeLangues.items()):
  #print("key =", key , 'val =', val)
  l_id[int(val)] = key

In [11]:

def argmax(l):
  i = 1
  max = l[0] 
  arg = 0
  while i < len(l) :
    if l[i] > max :
      max = l[i]
      arg = i
    i = i + 1
  return arg

score = model.evaluate(x_test, y_test)
print('test.dat score = ', score)
l_pred = model.predict(x_test, batch_size=None, verbose=1, steps=None)
i = 0
while i < len(l_pred) :
  #print(l_pred[i])
  predicted = l_id[argmax(l_pred[i])]
  gold = l_id[argmax(y_test[i])]
  print('pred =', predicted, 'gold =', gold)
  i = i + 1


test.dat score =  [0.1935911774635315, 1.0]
pred = fr gold = fr
pred = ca gold = ca
pred = ca gold = ca
pred = de gold = de
pred = sl gold = sl
pred = en gold = en
pred = fr gold = fr
pred = fr gold = fr
pred = it gold = it
pred = nl gold = nl
pred = es gold = es
pred = nl gold = nl
pred = en gold = en
pred = es gold = es
pred = es gold = es
pred = ca gold = ca
pred = ca gold = ca
pred = en gold = en
pred = it gold = it
pred = it gold = it
pred = it gold = it
pred = sl gold = sl
pred = de gold = de
pred = de gold = de
pred = fr gold = fr
pred = de gold = de
pred = pt gold = pt
pred = it gold = it
pred = pt gold = pt
pred = ca gold = ca
pred = sl gold = sl
pred = pt gold = pt
pred = sl gold = sl
pred = es gold = es
pred = es gold = es
pred = it gold = it
pred = en gold = en
pred = es gold = es
pred = it gold = it
pred = pt gold = pt
pred = nl gold = nl
pred = en gold = en
pred = nl gold = nl
pred = pt gold = pt
pred = pt gold = pt
pred = pt gold = pt
pred = nl gold = nl
pred = de gold =