Fouille de textes / Text Mining

Extraction d’information / Information Extraction
Nos travaux ont d’abord concernés l’usage de techniques d’apprentissage supervisées statistiques en extraction d’information automatique. Nous avons tout d’abord utilisé l’algorithme BWI (Boosted Wrapper Induction) pour réaliser de l’extraction d’information à partir de pages Web collectées (Système Agathe 2 – 2012). Puis nous nous sommes intéressé à l’usage d’ontologies dans la classification automatique de textes, notamment l’intérêt de la conceptualisation en utilisant la méthode Rocchio avec la thèse de S. Albitar soutenue en 2013.
Ensuite nos travaux ont porté sur l’utilisation combinée d’ontologies et d’apprentissage supervisé symbolique (ou relationnel), par programmation logique inductive (PLI) pour l’extraction d’entités nommées et surtout de relations (binaires) entre entités nommées. Cela a conduit au développement du système OntoILPER dans le cadre de la thèse de R. Lima soutenue en 2014. Nous avons poursuivi ces travaux avec l’utilisation de « triple-store » dans sa mise en œuvre (Mémoire de Master Recherche de C.C. Ngo et D. Magdy), et l’usage de méthodes d’ensemble pour améliorer le processus d’apprentissage du système OntoILPER.
Ensuite nous nous sommes intéressé à l’extraction de relations par apprentissage profond, notamment en utilisant une représentation vectorielle enrichie notament par la prise en compte des dépendances syntaxiques au niveau du Word Embedding (Mastères de R. Azcurra et de A. Merad). Une application de cette approche à l’étiquetage de rôles spatiaux liés à des trajectoires de randonnées est l’objet de la thèse de A. Moussa. Toujours en utilisant l’apprentissage profond, citons aussi la thèse de M. Mallek sur l’extraction et la classification de relations selon le contexte dans des documents textuels non structurés, et la thèse CIFRE avec la société Mooben&Roster de Y. Duperis sur le développement d’un système d’aide à la constitution de consortiums d’entreprises compétents pour un appel d’offre avec une approche basée sur le traitement des langues et les ontologies.

Our work first concerned the use of statistical supervised learning techniques in automatic information extraction. We first used the BWI (Boosted Wrapper Induction) algorithm to perform information extraction from collected web pages (Agathe 2 system – 2012). Then we were interested in the use of ontologies in automatic text classification, in particular the interest of conceptualisation using the Rocchio method with the thesis of S. Albitar defended in 2013.
Then our work focused on the combined use of ontologies and symbolic (or relational) supervised learning, by inductive logic programming (ILP) for the extraction of named entities and especially of (binary) relations between named entities. This led to the development of the OntoILPER system in the framework of R. Lima thesis defended in 2014. We continued this work with the use of « triple-store » in its implementation (Research Master thesis of C.C. Ngo and D. Magdy), and the use of ensemble methods to improve the learning process of the OntoILPER system.
Finnaly we were interested in the extraction of relations by deep learning, in particular by using a vector representation enriched by taking into account syntactic dependencies at the level of Word Embedding (Master’s thesis of R. Azcurra and A. Merad). An application of this approach to the labelling of spatial roles related to walking trajectories is the subject of A. Moussa thesis. Still using deep learning, let us also mention the thesis of M. Mallek on the extraction and classification of relations according to the context in unstructured textual documents, and the thesis of Y. Duperis (with the Mooben&Roster compagny) on the development of a system to assist in the constitution of consortia of competent companies for a call for tenders with an approach based on language processing and ontologies.

Résumé automatique / Automatic Summarization
La thèse de S. Lamsiyah concerne le résumé extractif multi-documents non supervisé à base d’apprentissage profond, tout d’abord concernant le « Generic Multi-Document Summarization (G-MDS) » avec tout d’abord une approche centroid et différents « sentence embedding representations », et ensuite en exploitant l’apprentissage par transfert (Transfert learning) à partir du réglage fin de BERT (Bidirectional Encoder Representations from Transformers) sur des tâches de compréhension du langage naturel pour l’apprentissage de la représentation des phrases. Concernant le Query-Focused Multi-Document Summarization (QF-MDS), nous proposons une méthode extractive non supervisée basée sur l’apprentissage par transfert à partir de modèles d’intégration de phrases pré-entraînés (modèle BM25) combinés avec le critère de pertinence marginale maximale (maximal marginal relevance criterion). Citons aussi le travail sur la recherche de cohérence dans les résumé extractifs, notamment avec le travail du master brésilien de R. Garcia, sur un résumé extractif cohérent mono-document à base de programmation en nombre entier, et plus récemment, avec les travaux du master brésilien de P. Assis, sur le résumé cohesif extractif mono-document basé sur la représentation sémantique AMR (Abstract Meanning Representation) du texte à résumer.

The thesis of S. Lamsiyah concerns unsupervised multi-document summarization based on deep learning, firstly in Generic Multi-Document Summarization (G-MDS) with a centroid approach and different sentence embedding representations, and secondly by exploiting transfer learning from BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on natural language comprehension tasks for sentence representation learning. Concerning Query-Focused Multi-Document Summarization (QF-MDS), we propose an unsupervised extractive method based on transfer learning from pre-trained sentence integration models (BM25 model) combined with the maximal marginal relevance criterion. Let us also mention the work on the search for coherence in extractive summarization, in particular with the work of the Brazilian master of R. Garcia, on a coherent single-document extractive summarization based on integer programming, and more recently, with the work of the Brazilian master of P. Assis, on the cohesive single-document extractive summarization based on the semantic representation AMR (Abstract Meaning Representation) of the text to be summarized.

Simplification automatique de texte / Text Simplification
Enfin, depuis peu et en collaboration avec des linguistes, nous nous intéressons à la simplification automatique de texte, notamment avec la thèse de R. Hijazi sur la simplification syntaxique de textes s’appuyant sur la representation sémantique à base de graphe DMRS (Dependency Minimal Recursion Semantics) et la réécriture de graphes.

Finally, recently and in collaboration with linguists, we are interested in automatic text simplification, in particular with the thesis of R. Hijazi on syntactic simplification of texts based on the DMRS (Dependency Minimal Recursion Semantics) graph-based semantic representation and the rewriting of graphs.