{"id":505,"date":"2021-12-10T12:01:15","date_gmt":"2021-12-10T11:01:15","guid":{"rendered":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/nouveausite\/?page_id=505"},"modified":"2022-01-09T23:49:38","modified_gmt":"2022-01-09T22:49:38","slug":"fouille-de-textes-text-mining","status":"publish","type":"page","link":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/fouille-de-textes-text-mining\/","title":{"rendered":"Fouille de textes \/ Text Mining"},"content":{"rendered":"<p><span style=\"font-size: 18pt;\"><strong>Extraction d&rsquo;information \/ <em>Information Extraction<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 14pt;\">Nos travaux ont d&rsquo;abord concern\u00e9s l\u2019usage de techniques d\u2019apprentissage supervis\u00e9es statistiques en extraction d&rsquo;information automatique. Nous avons tout d&rsquo;abord utilis\u00e9 l&rsquo;algorithme BWI (Boosted Wrapper Induction) pour r\u00e9aliser de l&rsquo;extraction d\u2019information \u00e0 partir de pages Web collect\u00e9es (Syst\u00e8me Agathe 2 &#8211; 2012). Puis nous nous sommes int\u00e9ress\u00e9 \u00e0 l&rsquo;usage d&rsquo;ontologies dans la classification automatique de textes, notamment l&rsquo;int\u00e9r\u00eat de la conceptualisation en utilisant la m\u00e9thode Rocchio avec la th\u00e8se de <strong>S. Albitar<\/strong> soutenue en 2013.<\/span><br \/>\n<span style=\"font-size: 14pt;\">Ensuite nos travaux ont port\u00e9 sur l&rsquo;utilisation combin\u00e9e d&rsquo;ontologies et d&rsquo;apprentissage supervis\u00e9 symbolique (ou relationnel), par programmation logique inductive (PLI) pour l\u2019extraction d\u2019entit\u00e9s nomm\u00e9es et surtout de relations (binaires) entre entit\u00e9s nomm\u00e9es. Cela a conduit au d\u00e9veloppement du syst\u00e8me OntoILPER dans le cadre de la th\u00e8se de <strong>R. Lima<\/strong> soutenue en 2014. Nous avons poursuivi ces travaux avec l&rsquo;utilisation de \u00ab triple-store \u00bb dans sa mise en \u0153uvre (M\u00e9moire de Master Recherche de <strong>C.C. Ngo<\/strong> et <strong>D. Magdy<\/strong>), et l&rsquo;usage de m\u00e9thodes d\u2019ensemble pour am\u00e9liorer le processus d\u2019apprentissage du syst\u00e8me OntoILPER.<\/span><br \/>\n<span style=\"font-size: 14pt;\">Ensuite nous nous sommes int\u00e9ress\u00e9 \u00e0 l\u2019extraction de relations par apprentissage profond, notamment en utilisant une repr\u00e9sentation vectorielle enrichie notament par la prise en compte des d\u00e9pendances syntaxiques au niveau du Word Embedding (Mast\u00e8res de <strong>R. Azcurra<\/strong> et de <strong>A. Merad<\/strong>). Une application de cette approche \u00e0 l\u2019\u00e9tiquetage de r\u00f4les spatiaux li\u00e9s \u00e0 des trajectoires de randonn\u00e9es est l\u2019objet de la th\u00e8se de <strong>A. Moussa<\/strong>. Toujours en utilisant l&rsquo;apprentissage profond, citons aussi la th\u00e8se de <strong>M. Mallek<\/strong> sur l&rsquo;extraction et la classification de relations selon le contexte dans des documents textuels non structur\u00e9s, et la th\u00e8se CIFRE avec la soci\u00e9t\u00e9 Mooben&amp;Roster de Y. Duperis sur le d\u00e9veloppement d&rsquo;un syst\u00e8me d\u2019aide \u00e0 la constitution de consortiums d\u2019entreprises comp\u00e9tents pour un appel d\u2019offre avec une approche bas\u00e9e sur le traitement des langues et les ontologies.<\/span><\/p>\n<p><span style=\"font-size: 14pt;\"><em>Our work first concerned the use of statistical supervised learning techniques in automatic information extraction. We first used the BWI (Boosted Wrapper Induction) algorithm to perform information extraction from collected web pages (Agathe 2 system &#8211; 2012). Then we were interested in the use of ontologies in automatic text classification, in particular the interest of conceptualisation using the Rocchio method with the thesis of <strong>S. Albitar<\/strong> defended in 2013.<\/em><\/span><br \/>\n<span style=\"font-size: 14pt;\"><em>Then our work focused on the combined use of ontologies and symbolic (or relational) supervised learning, by inductive logic programming (ILP) for the extraction of named entities and especially of (binary) relations between named entities. This led to the development of the OntoILPER system in the framework of <strong>R. Lima<\/strong> thesis defended in 2014. We continued this work with the use of \u00ab\u00a0triple-store\u00a0\u00bb in its implementation (Research Master thesis of <strong>C.C. Ng<\/strong>o and <strong>D. Magdy<\/strong>), and the use of ensemble methods to improve the learning process of the OntoILPER system.<\/em><\/span><br \/>\n<span style=\"font-size: 14pt;\"><em>Finnaly we were interested in the extraction of relations by deep learning, in particular by using a vector representation enriched by taking into account syntactic dependencies at the level of Word Embedding (Master&rsquo;s thesis of <strong>R. Azcurra<\/strong> and <strong>A. Merad<\/strong>). An application of this approach to the labelling of spatial roles related to walking trajectories is the subject of A. Moussa thesis. Still using deep learning, let us also mention the thesis of <strong>M. Mallek<\/strong> on the extraction and classification of relations according to the context in unstructured textual documents, and the thesis of <strong>Y. Duperis<\/strong> (with the Mooben&amp;Roster compagny) on the development of a system to assist in the constitution of consortia of competent companies for a call for tenders with an approach based on language processing and ontologies.<\/em><\/span><\/p>\n<p><span style=\"font-size: 18pt;\"><strong>R\u00e9sum\u00e9 automatique \/ <em>Automatic Summarization<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 14pt;\">La th\u00e8se de <strong>S. Lamsiyah<\/strong> concerne le r\u00e9sum\u00e9 extractif multi-documents non supervis\u00e9 \u00e0 base d&rsquo;apprentissage profond, tout d&rsquo;abord concernant le\u00a0 \u00ab\u00a0Generic Multi-Document Summarization (G-MDS)\u00a0\u00bb avec tout d&rsquo;abord une approche centroid et diff\u00e9rents \u00ab\u00a0sentence embedding representations\u00a0\u00bb, et ensuite\u00a0 en exploitant l&rsquo;apprentissage par transfert (Transfert learning) \u00e0 partir du r\u00e9glage fin de BERT (Bidirectional Encoder Representations from Transformers) sur des t\u00e2ches de compr\u00e9hension du langage naturel pour l&rsquo;apprentissage de la repr\u00e9sentation des phrases. Concernant le Query-Focused Multi-Document Summarization (QF-MDS), nous proposons une m\u00e9thode extractive non supervis\u00e9e bas\u00e9e sur l&rsquo;apprentissage par transfert \u00e0 partir de mod\u00e8les d&rsquo;int\u00e9gration de phrases pr\u00e9-entra\u00een\u00e9s (mod\u00e8le BM25) combin\u00e9s avec le crit\u00e8re de pertinence marginale maximale (maximal marginal relevance criterion). Citons aussi le travail sur la recherche de coh\u00e9rence dans les r\u00e9sum\u00e9 extractifs, notamment avec le travail du master br\u00e9silien de <strong>R. Garcia<\/strong>, sur un r\u00e9sum\u00e9 extractif coh\u00e9rent mono-document \u00e0 base de programmation en nombre entier, et plus r\u00e9cemment, avec les travaux du master br\u00e9silien de <strong>P. Assis<\/strong>, sur le r\u00e9sum\u00e9 cohesif extractif mono-document bas\u00e9 sur la repr\u00e9sentation s\u00e9mantique AMR (Abstract Meanning Representation) du texte \u00e0 r\u00e9sumer.<\/span><\/p>\n<p><span style=\"font-size: 14pt;\"><em>The thesis of <strong>S. Lamsiyah<\/strong> concerns unsupervised multi-document summarization based on deep learning, firstly in Generic Multi-Document Summarization (G-MDS) with a centroid approach and different sentence embedding representations, and secondly by exploiting transfer learning from BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on natural language comprehension tasks for sentence representation learning. Concerning Query-Focused Multi-Document Summarization (QF-MDS), we propose an unsupervised extractive method based on transfer learning from pre-trained sentence integration models (BM25 model) combined with the maximal marginal relevance criterion. Let us also mention the work on the search for coherence in extractive summarization, in particular with the work of the Brazilian master of <strong>R. Garcia<\/strong>, on a coherent single-document extractive summarization based on integer programming, and more recently, with the work of the Brazilian master of <strong>P. Assis<\/strong>, on the cohesive single-document extractive summarization based on the semantic representation AMR (Abstract Meaning Representation) of the text to be summarized.<\/em><\/span><\/p>\n<p><span style=\"font-size: 18pt;\"><strong>Simplification automatique de texte \/ <em>Text Simplification<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 14pt;\">Enfin, depuis peu et en collaboration avec des linguistes, nous nous int\u00e9ressons \u00e0 la simplification automatique de texte, notamment avec\u00a0 la th\u00e8se de <strong>R. Hijazi<\/strong> sur la simplification syntaxique de textes s&rsquo;appuyant sur la representation s\u00e9mantique \u00e0 base de graphe DMRS (Dependency Minimal Recursion Semantics) et la r\u00e9\u00e9criture de graphes.<\/span><\/p>\n<p><span style=\"font-size: 14pt;\"><em>Finally, recently and in collaboration with linguists, we are interested in automatic text simplification, in particular with the thesis of R. Hijazi on syntactic simplification of texts based on the DMRS (Dependency Minimal Recursion Semantics) graph-based semantic representation and the rewriting of graphs.<\/em><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Extraction d&rsquo;information \/ Information Extraction Nos travaux ont d&rsquo;abord concern\u00e9s l\u2019usage de techniques d\u2019apprentissage supervis\u00e9es statistiques en extraction d&rsquo;information automatique. Nous avons tout d&rsquo;abord utilis\u00e9 l&rsquo;algorithme BWI (Boosted Wrapper Induction) pour r\u00e9aliser de l&rsquo;extraction d\u2019information \u00e0 partir de pages Web collect\u00e9es (Syst\u00e8me Agathe 2 &#8211; 2012). Puis nous nous sommes int\u00e9ress\u00e9 \u00e0 l&rsquo;usage d&rsquo;ontologies dans &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/fouille-de-textes-text-mining\/\" class=\"more-link\">Continuer la lecture <span class=\"screen-reader-text\"> \u00ab\u00a0Fouille de textes \/ Text Mining\u00a0\u00bb<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_crdt_document":"","footnotes":""},"class_list":["post-505","page","type-page","status-publish","hentry","entry"],"_links":{"self":[{"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/pages\/505","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/comments?post=505"}],"version-history":[{"count":5,"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/pages\/505\/revisions"}],"predecessor-version":[{"id":1214,"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/pages\/505\/revisions\/1214"}],"wp:attachment":[{"href":"https:\/\/pageperso.lis-lab.fr\/bernard.espinasse\/index.php\/wp-json\/wp\/v2\/media?parent=505"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}