Schedule

CET Sunday 31/03 Monday 1/04
Tuesday 2/04 Wednesday 3/04 Thursday 4/04
Friday 5/04
8-9 Breakfast Breakfast Breakfast Breakfast Breakfast
9-10 Welcome session Thomas Wolf
Claire Gardent
Nature activities
Maxime Peyrard
10-11 Poster Session 1
11-12
12-13 Lunch Lunch Lunch Lunch Lunch
13-14 Sara Hooker
Nature activities
Poster Session 2
Lab Session
LLM
Shuttle back
to Grenoble
14-15
15-16
16-17 Lab Session
MT
Barbara Plank
André Martins
Arrival in Grenoble
at 4pm
17-18 Shuttle departs
from Grenoble at 5pm
18-19
19-20 Dinner Dinner Dinner Dinner Dinner
20-21 Poster Session 3
21-22

Poster Sessions

Session 1

  • A 1 Marco Bronzini
    Glitter or Gold? Deriving Structured Insights from Sustainability Reports via Large Language Models
    [Abstract] Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors’ increasing attention to Environmental, Social, and Governance (ESG) issues. Publicly released information on sustainability practices is often disclosed in diverse, unstructured, and multi-modal documentation. This poses a challenge in efficiently gathering and aligning the data into a unified framework to derive insights related to Corporate Social Responsibility (CSR). Thus, using Information Extraction (IE) methods becomes an intuitive choice for delivering insightful and actionable data to stakeholders. In this study, we employ Large Language Models (LLMs), In-Context Learning, and the Retrieval-Augmented Generation (RAG) paradigm to extract structured insights related to ESG aspects from companies’ sustainability reports. We then leverage graph-based representations to conduct statistical analyses concerning the extracted insights. These analyses revealed that ESG criteria cover a wide range of topics, exceeding 500, often beyond those considered in existing categorizations, and are addressed by companies through a variety of initiatives. Moreover, disclosure similarities emerged among companies from the same region or sector, validating ongoing hypotheses in the ESG literature. Lastly, by incorporating additional company attributes into our analyses, we investigated which factors impact the most on companies’ ESG ratings, showing that ESG disclosure affects the obtained ratings more than other financial or company data.
  • A 2 Hichem Ammar Khodja
    WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models
    [Abstract] The factuality of large language model (LLMs) tends to decay over time since events posterior to their training are “unknown” to them. One way to keep models up-to-date could be factual update: the task of inserting, replacing, or removing certain simple (atomic) facts within the model. To study this task, we present WikiFactDiff, a dataset that represents changes between two dates as a collection of simple facts divided into three categories: new, obsolete, and unchanged. We describe several update scenarios arising from various combinations of these three types of basic update. The facts are represented by subject-relation-object triples; indeed, WikiFactDiff was constructed by comparing the state of the Wikidata knowledge base at 4 January 2021 and 27 February 2023. Those fact are accompanied by verbalization templates and cloze tests that enable running update algorithms and their evaluation metrics. Contrary to other datasets, such as zsRE and CounterFact, WikiFactDiff constitutes a realistic update setting that involves various update scenarios, including replacements, archival, and new entity insertions. We also present an evaluation of existing update algorithms on WikiFactDiff.
  • A 3 Felix Herron
    Measuring bias in ASR
    [Abstract] A survey on bias studies in modern speech processing and methods for reducing it.
  • A 4 Fanny Ducel
    Bias Research for Language Models is Biased: a survey for deconstructing bias in LLMs
    [Abstract] Fairness and independence from bias are emerging as major quality criteria for Natural Language Processing applications. It is therefore crucial to provide a better understanding and control of these biases. This survey paper presents a review of recent research addressing the study of bias in language models. We use queries to scientific articles search engines (mainly the ACL anthology) and snowballing to identify a wide range of articles. Our analysis reveals that bias research mainly addresses methods for defining, measuring and mitigating bias. We highlight biases inherent to research on stereotypical biases in language models and conclude by calling for greater linguistic, cultural and typological diversity, and for greater transparency regarding these potentially biasing elements. The presented poster will focus on the section about inherent bias research, and present some statistics related to the sociological position of authors who work on bias in LLMs. The figures and related study come from an article (in French) that is currently under review.
  • A 5 Ekaterina Uetova
    Exploring the Role of Conversational Agents in Enhancing Social Support within Online Peer Support Groups
    [Abstract] Online peer support groups (OPSGs) have proven effective in health interventions, supporting people during difficult phases of their lives. However, challenges arise in sustaining engagement. Our research highlights the critical role of support in online health interventions, emphasizing the need to address issues such as low response rates and attrition of peer supporters. This study employed a mixed-methods research approach, analyzing post-study interviews and chat messages from two cohorts of the MOV’D randomized controlled trial aimed at reducing sedentary behavior. Analysis of chat messages based on the coding system we developed revealed: participants’ preference to interact with social, emotional and personal messages over informational ones, the significance of receiving responses for participants’ self-esteem and group engagement, and the impact of social presence cues on fostering a supportive atmosphere in OPSGs. Drawing from these findings, we proposed a framework for a conversational agent designed to promote balanced participation with minimal disruptions by providing supportive responses, prompting interaction, and delivering relevant information when needed.
  • A 6 Diandra Fabre
    Sign Language Translation challenges in the Large Language Models era
    [Abstract] Sign Languages (SL) are visual languages used among Deaf communities. While spoken languages are sequential languages (one word after another), sign languages use a 3D space and two information can be signed simultenously. Moreover, datasets are limited and complex to acquire, thus Sign Languages are considered low-resourced languages. Transformer architectures inspired by Vaswani et al. (2017) have improved sequence-to-sequence translation tasks. Sign Language translation methods use a video sequence as input and a text sequence as output. This task could also benefit from a transformer architecture, as presented in Camgöz et al. (2020), Sincan et al. (2023), Tarrès et al. (2023). We focus this work on French Sign Language (LSF). Due to the lack of data, SL translation model are not able to comprehend and represent the French language. Grammar and syntactic rules are not understood by the model. On the other hand, Large Language Models (LLMs) are able to comprehend and generate the French language with precision. In this work, we first present different baselines for Sign Language Translation. We then present several approaches envisaged to guide Sign Language Translation models using LLMs.
  • A 7 Christopher Klamm
    Populist Rhetoric
    [Abstract] Populist rhetoric is increasingly prevalent worldwide and affects all countries (Mudde/ Kaltwasser 2017). Although populism has been identified as a heterogeneous political phenomenon (ib.) and a multi-dimensional construct (Wuttke et al. 2020), its detection in the text often remains abstract (Gründl 2020, Bonikowski et al. 2022, Dai/ Kustov 2022, Jankowski/ Huber 2023), limiting the detailed view on populist rhetorical patterns. Rooduijn and Pauwels (2011) highlight people-centrism and anti-elitism (a negative stance towards the elite) as core features of populism, where both dimensions are essential (Mudde 2004, Rooduijn 2019, Dai/ Kustov 2022). To better analyze these dimensions, especially anti-elitism, we extend the idea of Opinion Role Labeling (ORL) (Wiegand/ Ruppenhofer 2015; Marasovic/Frank 2018; Zhang et al. 2019) to analyze negative and positives stances towards the elite and the people.
  • A 8 Jing Liu
    Do language models need exponentially more data than infants: Probing word-form acquisition
    [Abstract] TBA

Session 2

  • B 1 Thibaud Leteno
    Fair text classification with Wasserstein independence
    [Abstract] Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g. women vs. men) remains an open challenge. We present a novel method for mitigating biases in neural text classification, agnostic to the model architecture. Considering the difficulty to distinguish fair from unfair information in a text encoder, we take inspiration from adversarial training to induce Wasserstein independence between representations learned to predict our target label and the ones learned to predict some sensitive attribute. Our approach provides two significant advantages. Firstly, it does not require annotations of sensitive attributes in both testing and training data. This is more suitable for real-life scenarios compared to existing methods that require annotations of sensitive attributes at train time. Secondly, our approach exhibits a comparable or better fairness-accuracy trade-off compared to existing methods.
  • B 2 Matthieu François
    Quantitative analysis of opposition to transition technologies on social networks.
    [Abstract] This thesis aims to adopt data mining techniques to study commitments to ecological transition, and in particular to characterize the rise of debates on transition technologies on socionumeric networks. Social networks are an extremely interesting source of information, as the preferred medium of communication and information for protesters. We propose to evaluate the potential of the data available on several platforms in terms of quantity, quality and the cost required for this collection. Secondly, the description of debates on the data collected will be formalized in the form of patterns and markers that can be automatically detected by algorithms. Finally, the third objective will be to analyze the evolution of these markers over time to isolate characteristics of the trajectories leading debates on social networks towards active engagement in the field. Surveys and interviews will be used to assess the relevance of the results obtained using data mining tools.
  • B 4 Manuel Faysse
    CroissantLLM: A truly bilingual french english language model
    [Abstract] We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.
  • B 5 Louis Jourdain
    Automatic morphological analysis for low resource languages : the case of Ancient Greek compound nouns
    [Abstract] Ancient greek compounds nouns present some very interesting morphological properties (class and suffix transfer during compounding). However, if the methods of historical linguistics managed to accurately describe these phenomena, they failed to quantify it or give a general overview of the system due to the lack of data. This example will serve as a use case / brainstorming on how to deal with low resources languages. Several different ressources (annotated but small database, dictionnaries…) will be used and combined to try to automate efficienly compound analysis. The objective of this study is to determine how it could be possible to answer some linguistics questions in a bottom up fashion only looking at the data, and evaluaing the noise created at each processing step.
  • B 6 Alicia Breidenstein
    Using Locally Learnt Word Representations for better Textual Anomaly Detection
    [Abstract] The literature on general purpose textual Anomaly Detection is quite sparse, as most textual anomaly detection methods are implemented as out of domain detection in the context of pre-established classification tasks. Notably, in a field where pre-trained representations and models are of common use, the impact of the pre-training data on a task that lacks supervision has not been studied. We use the simple setting of k-classes out anomaly detection and search for the best pairing of representation and classifier. We show that well-chosen embeddings allow a simple anomaly detection baseline such as OC-SVM to achieve similar results and even outperform deep state-of-the-art models.
  • B 7 Théo Gigant
    Representations of Encoded Features for Extractive and Abstractive Summarization of Multimodal Content
    [Abstract] TBA
  • B 8 Nathan Godey
    Headless Language Models: Learning without Predicting with Contrastive Weight Tying
    [Abstract] TBA
  • B 9 Kushal Tatariya
    Low-resource Multilingual NLP
    [Abstract] TBA

Session 3

  • C 1 Yanzhu Guo
    The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
    [Abstract] This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
  • C 2 Ryan Whetten
    Open Implementation and Study of BEST-RQ for Speech Processing
    [Abstract] Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ’s great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
  • C 4 Konstantin Dobler
    FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models
    [Abstract] Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model’s embedding matrix. In this paper, we propose FOCUS – Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that effectively initializes the embedding matrix for a new tokenizer based on information in the source model’s embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work on language modeling and on a range of downstream tasks (NLI, QA, and NER). We publish our model checkpoints and code on GitHub.
  • C 5 Irina Proskurina
    When Quantization Affects Confidence of Large Language Models?
    [Abstract] Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization leads to a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
  • C 6 Giovanni Pinna
    Investigating the Understanding Limits of Large Language Models on Tabular Data: A Study on Explainability and Trustworthiness (recently started work)
    [Abstract] Large language models (LLMs) have revolutionized several natural language processing tasks, but their understanding and handling of tabular data is still understudied. In this study, we delve the understanding limitations of LLMs when they are given tabular data combined with questions as prompts. To ensure the clarity and interpretability of LLMs’ responses, we curated a special dataset tailored for explainability and to test the limits on understanding tabular data. The dataset employed in this study is deliberately designed to encompass increasing numbers of columns and rows of tabular data, simulating real-world scenarios with varying data and question complexities. Our study encompasses an investigation into the optimal methods and formats for presenting tabular data as prompts to LLMs. In this way, we aim to identify the most conducive approach for facilitating LLMs’ understanding of tabular data, thereby informing best practices for tabular data presentation in LLM-based applications. We employ state-of-the-art tools for explainability to scrutinize the extent to which LLMs comprehend the underlying structure and semantics of the tabular data, we aim to ascertain the reliability and trustworthiness of the LLMs’ responses.
  • C 7 Aleksei Dorkin
    Sõnajaht: Definition Embeddings and Semantic Search for Reverse Dictionary Creation
    [Abstract] We present an IR-based reverse dictionary system using modern pre-trained language models and approximate nearest neighbors search algorithms. The proposed approach is applied to an existing Estonian language lexicon resource Sõnaveeb (word web) with the purpose of enhancing and enriching it by introducing the cross-lingual reverse dictionary functionality powered by semantic search. The performance of the system is evaluated using both an existing labeled English dataset of words and definitions that is extended multilingually to contain also Estonian and Russian translations, and a novel unlabeled evaluation approach that extracts the evaluation data from the lexicon resource itself using synonymy relations. Evaluation results indicate that the IR-based semantic search approach without any model training is feasible, producing median rank of 1 monolingually and and median rank of 2 cross-lingually using the unlabeled evaluation setting, with models trained for cross-lingual retrieval and including Estonian in their training data showing superior performance in our particular task.
  • C 8 Rian Touchent
    CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data
    [Abstract] Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.
  • C 9 Muhammad Khalifa
    Discriminator-Guided Chain-of-Thought Reasoning
    [Abstract] TBA
  • C 10 Nakanyseth Vuth
    KG2SyntheticData
    [Abstract] A framework for generating annotated synthetic data for Information Extraction models to address 1) Data Shortage, 2) Security/Compliance Risks
  • C 11 Maxime Méloux
    Model Explainability through the lens of Causal Abstraction
    [Abstract] Commonly used model explainability methods are highly local and fail to consider the global function computed by the network. The framework of causal abstraction, based on the theory of causal inference, suggests that it is possible to obtain an understandable explanation of a model’s output in a given behavior by abstracting its computation graph into a structural causal model (SCM). However, this is a hard task, partly due to the lack of available golden data. This works aims to create a benchmark for causal abstraction, containing pairs of SCMs along with neural models that have been trained to implement said SCMs through the technique of intervention interchange training. It also addresses challenges in identifiability and unicity of causal discovery in neural models.
  • C 12 Maksim Aparovich
    Advancing Cross-Lingual Question Answering with Adversarial Training
    [Abstract] The work addresses the challenge of cross-lingual question answering (QA) by proposing an adversarial training approach to align representations for closely related languages. Drawing inspiration from applications of domain adaptation techniques, we aim to unify representations within QA systems, with an initial hypothesis of improving QA performance on unseen languages. The work describes our method, its benefits and drawbacks, and compares it with a naive baseline.