These Salima Lamsiyah – Bernard Espinasse

Salima Lamsiyah, « Deep Learning-Based Unsupervised Extractive Methods for Multi-Document Summarization » (defended in 2021)

Manuscrit

La thèse de S. Lamsiyah concerne le résumé extractif multi-documents non supervisé à base d’apprentissage profond, tout d’abord concernant le « Generic Multi-Document Summarization (G-MDS) » avec tout d’abord une approche centroid et différents « sentence embedding representations », et ensuite en exploitant l’apprentissage par transfert (Transfert learning) à partir du réglage fin de BERT (Bidirectional Encoder Representations from Transformers) sur des tâches de compréhension du langage naturel pour l’apprentissage de la représentation des phrases. Concernant le Query-Focused Multi-Document Summarization (QF-MDS), nous proposons une méthode extractive non supervisée basée sur l’apprentissage par transfert à partir de modèles d’intégration de phrases pré-entraînés (modèle BM25) combinés avec le critère de pertinence marginale maximale (maximal marginal relevance criterion).

The Salima Lamsiyah (defended in 2021) thesis focuses on Extractive ATS systems, and more specifically on Generic Multi-Document Summarization (G- MDS) and Query-Focused Multi-Document Summarization (QF-MDS) tasks. G-MDS systems generate summaries that represent all relevant facts of the source documents without considering the users’ information needs. Besides, QF-MDS systems produce summaries where the content of the summary is derived from the user’s information need or simply the user’ s query. Our main objective is to develop robust and effective systems for both G-MDS and QF-MDS tasks that require no domain knowledge. Therefore, we propose four contributions to deal with this issue and to improve the performance of unsupervised extractive multi-document summarization. In the first contribution, we propose an unsupervised extractive method for G-MDS based on the centroid approach and the sentence embedding representations. We improve sentence scoring by combining three metrics, including sentence content relevance, sentence novelty, and sentence position. Moreover, we provide a comparative analysis of nine sentence embedding models used to represent sentences as dense vectors in a low dimensional vector space in the context of extractive multi-document summarization. In the second contribution, we improve the aforementioned G-MDS method by leveraging transfer learning from BERT fine-tuning on Natural Language Understanding tasks for sentence representation learning. Specifically, we fine-tune BERT on supervised intermediate tasks from GLUE benchmark using single-task and multi-task fine-tuning. Then, we transfer the learned knowledge to our summarization task. In the third contribution, we propose an unsupervised extractive QF-MDS method based on transfer learning from pre-trained sentence embedding models, BM25 model, and maximal marginal relevance criterion. We combine the BM25 model with the semantic similarity to select a subset of sentences based on their relevance to the query. Moreover, we incorporate sentence embedding representation in the maximal marginal relevance method to re-rank the candidate sentences by maintaining query relevance and minimizing redundancy. In the last contribution, we explore the potential of the recent pre-trained Sentence-BERT (SBERT), based on the Siamese network structure and fine-tuning mechanism, to boost the performance of extractive query-focused multi- document summarization (QF-MDS) task.