Automated L2 Speaking Assessment (AL2SA) International Workshop 2026
9:00 - 18:00
Saint-Martin-d'Hères - Domaine universitaire
Salle Jacques Cartier at the Maison des Langues
1141 Avenue Centrale, 38400 Saint-Martin-d'Hères, France
Recent advances in artificial intelligence, particularly multimodal language models, are opening unprecedented opportunities for automated assessment of speaking skills in second language learning. Yet fundamental questions remain: What exactly should we assess in L2 oral production? How can technology best serve valid, reliable, and ethical assessment practices?
The AL2SA workshop (pron. /ˈæl.tsə/) brings together international experts to address these critical issues. Over two days, we will explore the theoretical foundations, methodological challenges, and practical applications of automated speaking assessment, balancing technological innovation with pedagogical and psychometric rigor.
Key Information
-
Dates: March 5-6, 2026
-
Format: Hybrid (on-site + Zoom)
-
Venue: Salle Jacques Cartier, Maison des Langues et des Cultures, Université Grenoble Alpes
-
Registration: Free and open to everyone – registration needed
-
Lunch registration deadline: February 20, 2026
Organized by: UGA's Language Skills Assessment Unit (Cellule d'évaluation des compétences en langues, Service des Langues) in partnership with LIDILEM and LIG laboratories, UGA's Language Center and the Multidisciplinary Institue in Artificial Intelligence (MIAI).


Practical Information
📍 Venue & Access
Salle Jacques Cartier
Maison des Langues et des Cultures
Université Grenoble Alpes
1141 Avenue Centrale, 38400 Saint-Martin-d'Hères, France
-
From Grenoble train station: Take Tram B (25 min) to Bibliothèques Universitaires (Direction: Gières Plaine des Sports)
-
From Gières Gare – Université station: Take Tram B (5 min) to Bibliothèques Universitaires (Direction: Oxford)
🌍 Remote Participation
Join via Zoom: https://univ-grenoble-alpes-fr.zoom.us/j/94720156701?pwd=Zyhi1WuR3UxAuaa2be5hPyq8bowJmT.1
Program
THURSDAY, MARCH 5TH
9:00-9:15 OPENING (Alice Henderson, Univ. Grenoble Alpes, France)
Session n°1: Assessing speaking skills
9:20-9:50 Linda Terrier & Lionel Fontan
Univ. Toulouse Jean-Jaurès, France
Archean Labs, France
linda.terrier
univ-tlse2.fr (Slides upon request (work in progress))
What Are We Actually Measuring? Reflections on the Construct of Intelligibility in Automated L2 Speech Assessment
The measurement of intelligibility has become a cornerstone of L2 speech assessment, whether human or automated. However, this construct, which seems obvious and stable at first glance from Munro and Derwing's 1995 seminal definition (“the extent to which a speaker's message is understood by a listener”), proves to be much more complex when we look at the concrete methods used to evaluate it. Like listening comprehension, intelligibility can only be measured indirectly, which systematically raises two fundamental questions: what exactly is being measured through the proposed elicitation task, and through the chosen evaluation method?
This impossibility of direct access to a listener’s understanding—and thus to the speaker’s intelligibility—renders any measurement of intelligibility fundamentally problematic and complex. Our recent scoping review (Terrier et al., under review) has further revealed a wide diversity of elicitation tasks and methods used to assess intelligibility. Sound identification, orthographic transcription, keyword spotting, subjective ratings, comprehension questions… each modality engages the listener at different levels of processing, resulting in the measurement of partially distinct constructs.
In the first part of this presentation, Linda Terrier will situate the construct of intelligibility within the broader framework of listening comprehension, by analyzing several assessment modalities from the perspective of the Kintsch & van Dijk model of comprehension (1998), which distinguishes between low- and high-levels of comprehension through the construction of the microstructure, macrostructure, and situation model of the message at hand.
To concretely illustrate these issues, Lionel Fontan will then present an example of a recently developed task: oral translation of short sentences. This type of task provides a semantic reference for the message the learner wishes to convey, while allowing flexibility in the linguistic form. Lionel Fontan has used this task to investigate the external validity of subjective intelligibility ratings, and to analyze the bias introduced by the absence of a reference for listeners.
Ultimately, because of the inherent complexity of intelligibility in L2 speech, we argue that any approach to its assessment—especially automated assessment—must begin by explicitly stating what aspect of intelligibility is being measured and through which task.
9:50-10:20 Nivja de Jong
Leiden Univ., the Netherlands
View presentation slides
What is speaking proficiency and how to develop high-quality, practical, and ethical automated assessments for its measurement?
In current classrooms, among the second language (L2) skills, practicing and assessing speaking are often neglected. Its loud and transient nature makes it hard for teachers to provide individualized feedback, and assessing speech recordings is highly time-consuming. Automated speaking assessment can help address these issues. In this presentation (based on De Jong et al., 2025), I first define speaking as a skill and outline the requirements for high-quality, practical, and ethical tools for automated scoring and feedback. Then, drawing on the AI-based assessment framework (Fang et al., 2023) and an educational design perspective, I propose recommendations on how computational linguists, educators, and assessment practitioners can join forces to develop automated systems that are technically sound, ethically responsible, and likely to be adopted in educational practice.
References
De Jong, N. H., Raaijmakers, S., & Tigelaar, D. (2025). Developing high-quality, practical, and ethical automated L2 speaking assessments. System, 134, 103796. https://doi.org/10.1016/j.system.2025.103796
Fang, Y., Roscoe, R. D., & McNamara, D. S. (2023). Artificial intelligence-based assessment in education. In B. Du Boulay, A. Mitrovic, & K. Yacef (Eds.), Handbook of Artificial Intelligence in Education (pp. 485–504). Edward Elgar Publishing. https://doi.org/10.4337/9781800375413.00033
10:20-11:00 Discussion
11:00-11:30 ☕ BREAK ☕
Session n°2: Listening Disfluency
11:30-12:10 Nobuaki Minematsu
Univ. of Tokyo, Japan
View presentation slides
Measuring, Analyzing, and Predicting Listening Disfluency of Learners and Raters: Using Speech and AI Technologies for Automated Assessment
Measuring, Analyzing, and Predicting Listening Disfluency of Learners and Raters: Using Speech and AI Technologies for Automated Assessment
Every learner aims to become easy to understand in L2 speech communication, yet their unique pronunciation may sometimes hinder this goal. Listening is a mental process that is difficult to observe directly, which is one reason learners often feel anxious about how smoothly they are understood. How can we measure listening disfluency? Do we need expensive brain-sensing techniques to quantify it?
In this talk, we present a pedagogically valid and practical method for measuring listening disfluency. Shadowing is an immediate reproduction of presented speech with a short delay, in which listeners repeat what they hear in their own accent. When listeners experience perceptual difficulty, their shadowed reproduction breaks down, revealing points at which cognitive processing load increases. By analyzing these disruptions, we can capture dynamic properties of listening disfluency.
We then demonstrate two applications of this measured disfluency. The first is visualizing global communicability, which represents how easily individual learners from around the world understand others and how easily they are understood in return. The second is the development of a virtual shadowing rater, built by collecting a human rater’s shadowing data for L2 English and using it to model intelligibility-based L2 speech assessment.
Keywords:
listening disfluency, shadowing-based assessment, L2 intelligibility, global communicability
12:10-12:30 Noriko Nakanishi (online)
Kobe Gakuin Univ., Japan
Slides for participants only
The Shadowing Exchange Community: Enhancing Accent Perception, Intelligible Speech, and Empathetic Feedback
While AI-based tools provide useful automated assessments of L2 fluency, they cannot fully replicate the socio-emotional dynamics of actual human communication. This presentation introduces the Shadowing Exchange Community, a peer-to-peer program that utilizes AI technology not as a final goal, but as a scaffolding device to enhance human-to-human interaction.
In this program, participants record 30-second speeches in both English and Japanese. During this process, they are presented with immediate Automatic Speech Recognition (ASR) results, allowing them to verify intelligibility—at least to the system—and re-record as needed before engaging with other learners. This ensures that AI serves to prepare participants for the community's core activity: a reciprocal exchange where learners from diverse backgrounds shadow each other and provide mutual feedback.
Crucially, the program is designed to enhance cross-cultural sensitivity and communication skills, with a specific focus on how to provide supportive feedback. This approach is structured around three key goals:
1. Practicing listening to various accents.
2. Checking one's own intelligibility with speakers of other L1s.
3. Learning to give constructive, respectful, and supportive feedback.
This cyclical, community-based model offers valuable insights into the social, emotional, and intercultural dimensions of language learning. As of November 2025, the program has engaged approximately 180 participants, including English L1 speakers (~70), Japanese L1 speakers (~60), and others (~50).
This presentation will share the program's structure and discuss preliminary findings on its impact on learners' metalinguistic awareness and affective filter.
Keywords: Shadowing, Peer Feedback, Cross-Cultural Communication, AI Scaffolding, Socio-Emotional Learning
12:30-13:00 Discussion
13:00-14:30 🍴 LUNCH 🍴 (Provided, please register)
Session n°3: Teachers' Open Session
14:30-15:30 Beata Walesiak
unpolish.pl, Poland
Slides for participants only
Apps for L2 pronunciation training
In this talk, educators will learn about the pedagogical use of commercially-available pronunciation and speech coaching apps, with a focus on the features and functionalities they include and the way they can be integrated into teaching and learning practices (in classroom instruction and in learners’ self-study). At the same time, the talk will critically examine common promises made by app developers, such as efficiency or judgement-free feedback, and discuss app limitations as well as implications of positioning AI as an authoritative evaluator of learner performance. The aim of the session is to help teachers make informed, context-sensitive choices when using the apps in pedagogy.
Bio
Beata Walesiak is a lecturer for Open University at University of Warsaw (UOUW) and Language Science and Technology (LST) at the Institute of Applied Linguistics at University of Warsaw, Poland. She’s also a teacher trainer, linguist and researcher with unpolish.pl. She has cooperated with a number of schools, academic institutions and start-ups within the domain of educational technologies, mobile and computer assisted pronunciation training, and AI-based speech pedagogy and assessment. She is also a dedicated IATEFL Pronunciation Special Interest Group (PronSIG) Committee member.
15:30-16:00 Sylvain Coulange, Pinxun Huang & Eli Stafford
Univ. Grenoble Alpes, France
Univ. Lorraine, France,
Univ. Paris Cité, France
View presentation slides
Designing a Speaking Assessment Module for the SELF Language Placement Test
This talk presents the development of an automated speaking assessment module for SELF, the online language placement test at Université Grenoble Alpes (https://self.univ-grenoble-alpes.fr/english/). The project, known as SELF Production Orale (SELF PO), is a collaborative effort between the Laboratoire de Linguistique et Didactique des Langues Étrangères et Maternelles (LIDILEM) and the Laboratoire d'Informatique de Grenoble (LIG). Two speaking modules are currently under development: one for English and one for French, with the French module developed in partnership with CUEF and ADCUEFE.
This talk will focus on the English speaking module, covering the development phases, an overview of the speaking tasks and assessment criteria, and preliminary results. We will conclude with a discussion of the key challenges and limitations encountered in implementing automated speaking assessment at scale, offering insights for institutions pursuing similar initiatives.
16:00-16:30 Discussion
16:30-17:30 ☕ SOCIAL BREAK ☕
FRIDAY, MARCH 6TH
Session n°4: Intelligibility
9:00-9:30 Dan Frost
Univ. Grenoble Alpes, France
View presentation slides
Addressing language-specific needs: what makes learner speech intelligible and how can we assess it?
Over the last 25 years, the focus of pronunciation teaching has increasingly moved away from teaching towards “native speaker norms” towards teaching for intelligibility, or as Levis (2005; 2020) puts it, the “nativeness principle” vs. “the intelligibility principle”. While this is a noble aim for the majority of learning situations, what makes learner speech more or less intelligible is still very much up for debate. Much of my work over the past ten years has been an attempt to better understand the nature of intelligibility and its relationship to comprehension, particularly in the context of French learners of English. To this end, we developed a set of descriptors (Frost & O’Donnell, 2018). The descriptors were initially created following the longitudinal ELLO project (Frost & O’Donnell, 2015), where we identified that the original CRFR phonological control descriptors (Council of Europe, 2001) lacked the necessary precision to address the language-specific needs of our learners. While the Companion Volume to the CEFR (Council of Europe, 2020) has gone further in recognizing the importance of prosodic features, its common, universal nature still fails to address language-specific needs of learners.
The Prosody Descriptors we developed are an attempt to address language specificity, and serve a dual function. First, they enable an accurate assessment of English pronunciation aspects that are particularly problematic for French speakers, focusing on features that significantly impede intelligibility. Second, they function as a practical pedagogical tool, allowing both learners and teachers to establish clear, actionable goals for pronunciation instruction. Although calibrated for French speakers, the features targeted by these descriptors are valid across all learners of English. The tool has since been deployed and validated in several subsequent studies (Frost, 2021; 2022; forthcoming; Vézien & Frost, forthcoming), confirming its utility and accuracy.
This presentation will explore the nature of intelligibility, the relationship between perception and production, particularly relating to pronunciation, and specifically in relation to the pronunciation of English by French learners. I will outline how these questions informed the development of the descriptors, and how they continue to inform my work as I try to better understand how to help my learners understand English and make themselves understood in a variety of both national and international contexts in English.
References
Council of Europe. (2001). A Common European Framework of Reference for learning, teaching and assessment. Cambridge University Press.
Council of Europe. (2020). A Common European Framework of Reference for learning, teaching and assessment. Companion Volume. Council of Europe Publishing.
Frost, D. (forthcoming) Pronunciation assessment: Deconstructing intelligibility and setting learning objectives. La clé des langues.
Frost, D. (2022). Doing pronunciation online: An embodied and cognitive approach, which puts prosody first. RANAM (Recherches Anglaises et Nord-AMéricaines), 55/2022, 11–28.
Frost, D. (2021). Prosodie, Intelligibilité et compréhensibilité : l’évaluation de la prononciation lors d’un stage court. Les Langues Modernes, 3(2020): 76-90.
Frost, D. et O’Donnell, J. (2018). Evaluating the essentials, the place of prosody in oral production. Dans J. Volín, (2018). (Ed.), The Pronunciation of English by Speakers of Other Languages. Cambridge: Cambridge Scholars Publishing. 228-259. ISBN: 1-5275-0390-9
Frost, D. et O’Donnell, J. (2015). Success: B2 or not B2, that is the question (the ELLO project - Etude Longitudinale sur la Langue Orale). Recherche et pratiques pédagogiques en langues de spécialité – Cahiers de l’APLIUT,34(2). https://doi.org10.4000/apliut.5195
Levis, J. M. (2005). Changing Contexts and Shifting Paradigms in Pronunciation Teaching. TESOL Quarterly, 39(3), 369–377. https://doi.org/10.1075/jslp.20050.lev
Levis, J. (2020). Revisiting the Intelligibility and Nativeness Principles. Journal of Second Language Pronunciation, 6(3), 310–328. https://doi.org/10.1075/jslp.20050.lev
Vezien, S & Frost, D. 2026 (In preparation) Talking Heads: Improving pronunciation with text-to-speech software. La clé des langues.
9:30-10:00 Kevin Hirschi
Univ. of Texas San Antonio, USA
Slides for participants only
Towards automated measurement and feedback of L2 intelligibility: Challenges and a pedagogically informed roadmap
Second language (L2) intelligibility represents a precursor for communication in which sounds, words, or phrase are understood by a listener. Therefore, a comprehensive understanding of what constitutes intelligibility in speech—and what causes loss of intelligibility—can provide insights into the development of L2 proficiency, inform L2 learning curricula, and create parameters for effective assessment and feedback. Focusing on L2 English in the North American academic context, this presentation begins with a review of research on linguistic features associated with intelligibility (e.g., Kang et al., 2018, 2020), as well as their complex, nonlinear predictive power across listener backgrounds (Hirschi et al., 2023, 2025; Shekar et al., 2023). I then review alignment of audio LLMs with listeners through the lens of intelligibility, analyzing divergence from listeners and relating these issues to model bias (Hirschi & Kang, 2024; Kang & Hirschi, 2025).
With an understanding of the challenges of aligning machine listening with human comprehension, I argue that the central goal of designing automated measurement and feedback solutions for L2 intelligibility starts from pedagogy informed by theory and research, rather than technological capacity. As such, I will focus the remainder of the presentation on the theoretical tenets and research-informed practices which can guide the design of automated measurement and feedback of L2 intelligibility for inclusive, effective, and sustainable L2 learning. Drawing from the social nature of language and Sociocultural perspectives (Vygotsky, 1987), automated measurement and feedback theoretically provide learners scaffolding and a stress-free simulation of interaction. Interactionist literature on feedback further informs the construction and delivery of automated feedback (Long, 1996), and outlines measurement that is most relevant for learning. Furthermore, learner agency and proactive behavior explain why and how some learners independently make more progress with automated learning tools, offering insights into differential interventions for important individual differences (Duff, 2012; Papi, 2025). I will conclude by presenting an early effort in implementing pedagogically informed automatic feedback (Hirschi et al., 2025) as a proof of concept and potential roadmap for L2 intelligibility measurement and development.
10:00-10:30 Joan Carles Mora
Univ. de Barcelona, Spain
View presentation slides
Measures of acoustic and perceptual contrastiveness and nativelikeness in assessing segmental pronunciation development
Assessing the development of L2 pronunciation at the segmental level is a methodological challenge, especially after short phonetic training interventions (e.g. 4 30-minute high-variability phonetic training sessions focusing on one target contrast) or short L2 pronunciation pedagogical interventions (e.g. a few sessions of pronunciation-focused task-based teaching) where the size of improvement is expected to be small (. Still, pronunciation assessment is crucial to be able to evaluate the effectiveness of different phonetic training techniques and pedagogical approaches to pronunciation instruction. For example, Saito & Plonsky (2019) meta-analysed 77 pronunciation teaching studies and found that those assessing pronunciation through acoustic measures focusing on specific speech dimensions (e.g. VOT, formant frequencies) in controlled elicitation tasks (e.g. read aloud) found pronunciation instruction to be more effective than those assessing pronunciation through perceptual judgments focusing on global dimensions (e.g. comprehensibility) in spontaneous speech (e.g. monologic oral narrative task. Since segmental pronunciation training and instruction typically focuses on challenging L2 vowel and consonant contrasts (e.g. /r/-/l/ for Japanese learners of English; [ð]-[ɾ] for English learners of Spanish; /iː/-/ɪ/ for Spanish learners of English) with a high functional load and potentially having a detrimental impact on L2 speech intelligibility, it is important to address the following pronunciation assessment issues in relation to improvement in segmental contrasts resulting from phonetic training or pronunciation instruction:
(1) Should improvement be measured in terms of contrastiveness, nativelikeness, or both?
(2) Should contrastiveness / nativelikeness be measured acoustically or perceptually? If perceptually (by humans), how? Rating tasks, discrimination and identification tasks, intelligibility tasks?
(3) To what extent is improvement in segmental contrasts measurable in spontaneous speech? How can we evaluate the impact of segmental learning (and phonetic training and pronunciation instruction focusing on segmental contrasts) on L2 speech intelligibility?
The overall aim of this talk is to stimulate discussion about the implications of the answers to these questions for the automated assessment of L2 pronunciation (at the segmental level) and the evaluation of segmental pronunciation features having an impact on L2 speech intelligibility.
References
Saito, K., & Plonsky, L. (2019). Effects of second language pronunciation teaching revisited: A proposed measurement framework and meta‐analysis. Language Learning, 69(3), 652-708.
10:30-11:00 Discussion
11:00-11:30 ☕ BREAK ☕
Session n°5: Interactions
11:30-12:00 Serge Bibauw & Zhaori Wang
Univ. Catholique de Louvain, Belgium
KU Leuven
View presentation slides
Conversational AI for spoken L2 development: meta-analysis of effectiveness studies and insights for assessment
Abstract to be added soon.
12:00-12:30 Tsuneo Kato
Doshisha Univ., Japan
Slides for participants only
Effect of Prompt Corrective Feedback and Analysis of Error Patterns in Learning Syntactic Form with Trialogue-based CALL System
A substantial amount of form-focused practice is necessary for second language (L2) learners who are transitioning from answering questions in a few words to answering in a sentence, and for those expanding their expressions. We are developing a computer-assisted language learning system that focuses on learning a syntactic form through conversing with two computer characters. The trialogue-based CALL system promotes a learner’s implicit learning of the focused form by first demonstrating a model conversation between the characters, then asking the learner similar questions. With recent advancements in automatic speech recognition (ASR) and natural language processing (NLP), we added a simple prompt corrective feedback (CF) function for the learner’s answer using Whisper ASR and GPT-4o. We conducted a comparative experiment in which two groups of Japanese university students practiced English inanimate subject construction with and without the CF and took pre-, post-, and three retention tests over a period of up to 100 days. The experimental results showed a significant effect of the CF in all the post- and retention tests. To improve appropriateness of the CF, we further developed an automatic classifier of the errors that the learners make into global errors that hinder communication and local ones that do not. The accuracy of the classifier was measured by comparing it with manual classification by native speakers of English. The accuracy improved with prompt engineering to a large language model (LLM).
12:30-13:00 Mayuko Aiba & Nobuaki Minematsu
Univ. of Tokyo, Japan
View presentation slides 1
View presentation slides 2
LLM-based interaction for academic language learning: three case studies
Large language models (LLMs) are increasingly used to support academic language learning, yet their educational value depends critically on how interaction is designed and situated. This presentation reports three case studies exploring LLM-based interaction for supporting students’ academic communication in higher education.
The first study investigates a GPT-based oral Q&A simulation system designed to help students prepare for their first international conference. By generating realistic questions based on students’ own papers and enabling spoken interaction, the system provides scalable opportunities to practice academic Q&A without intensive instructor involvement.
The second study focuses on feedback after academic Q&A sessions. We propose a BI-R framework that extends the Belief–Desire–Intention (BDI) model by explicitly incorporating Respect as a guiding principle for feedback generation. Experimental results suggest that while deep mental-state reasoning alone does not always outperform baseline approaches, feedback that embeds social sensitivity can be particularly effective for certain question types and learner characteristics.
The third study, LangInLab, explores situated interaction in engineering education by integrating vision- and voice-enabled AI agents into laboratory classes. Through role-based multimodal interaction, students practice technical English within authentic experimental contexts.
Together, these case studies illustrate how carefully designed LLM-based interactions can enhance academic language learning across diverse educational settings.
Keywords:
LLM-based Interaction, Spoken Academic Communication, Academic Language Learning
References (for the three case studies)
Aiba, M., Saito, D., & Minematsu, N. (2025). GPT-based simulation of oral Q&A to support students attending first conference. JALTCALL Trends, 1(1), 2163. https://doi.org/10.29140/jct.v1n1.2163
Aiba, M., Saito, D., & Minematsu, N. (2026) Incorporating Respect into LLM-Based Academic Feedback: A BI-R Framework for Instructing Students after Q&A Sessions, Proc. IWSDS (to appear)
Shigi, M., Rackauckas, Z., Akiyama, Y., & Minematsu, N. (2025) LangInLab: Augmenting Engineering Lab Instruction with Vision-and Voice-Enabled AI Agents for Language Learning, Proc. Human-Agent Interaction 2025.
13:00-13:30 Discussion
13:30-15:00 🍴 LUNCH 🍴 (Provided, please register)
Session n°6: LLM-Based Assessment
15:00-15:30 Nicolas Ballier
Univ. Paris Cité, France
View presentation slides
What's in the L2 speech signal? Calibrating Whisper probability scores with phonetic posteriorgrams
This paper presents a method to automatically compare the probabilities assigned by Whisper, which can be used as L2 speech scoring (Ballier et al., 2024) to the phoneme distribution probabilities assigned by Phonetic Posteriorgrams (PPGs, Morrisson et al. 2024).
Whisper is a speech foundational models that has been trained over 90 languages to transcribe speech into texts. It can be used to predict the spoken language and to do automatic speech recognition (ASR). A standard metric for the analysis of the quality of the transcription is word error rate (WER), which computes the distance between the Whisper transcription and the text actually pronounced. This requires a transcription but we investigate a textless method based on the acoustic prediction of what is actually in the signal, using phonetic Posteriograms (PPGs). A PPG (Morrison et al. 2024) is a time-based categorical distribution over acoustic units of speech (usually phonemes). This type of representation has been used to disentangle pronunciation features (Churchwell 2024, Morrison et al 2024) and to provide an interpretable representation in terms of phone categories. Several models have been trained over the TIMIT dataset to produce the PPGs. The assumption is that the signal can be interpreted in terms of phonemic realization, depending on the type of categories used in the training data (typically IPA symbols or TIMIT transcription conventions). For example, in the ppgs library (Churchwell et al.,2024), 42 categories have been produced, and for a given portion of the speech signal a probability for the 42 categories can be assigned to the phone realization. Usually, a topK probability method is retained, the highest probability corresponds to the phone actually predicted.
We compare these phone probabilities assigned by posteriorgrams to the Whisper transcriptions, using the token level (Liang et al., 2025, Ballier et al.,2024) obtained for the Whisper transcriptions. (Whisper transcriptions rely on a tokenization corresponding to a specific algorithm, see Ballier et al. 2024 for details).
This paper therefore aims at comparing the Whisper predictions at token level (syllables, pseudo-syllable or words) and the corresponding interpretation in terms of probability of the phonemic distributions of the corresponding portion of the signal. We discuss the alignment issues, the diffrent sizes of the time frames and the implementation methods. Several implementations of the stereograms exist with a 20 milliseconds frame time frame which means that the probability distribution is assigned for a 20 millisecond portion of the signal. We discuss the possible methods that can be used when the segment to be analyzed is over 20 milliseconds. We report preliminary investigations on the ISLE data.
15:30-16:00 Stefano Bannò
Univ. of Cambridge, UK
View presentation slides
Natural Language-based Assessment of L2 Oral Proficiency using LLMs
Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.
16:00-16:30 Luis Da Costa & Laura Rupp
Vrije Univ. Amsterdam, the Netherlands
View presentation slides
Perspectives for Large Scale Speaking Assessment within the Mooc “English Pronunciation in a Global World”
In this talk, we present ongoing work on building a learner corpus of spoken English derived from a large Massive Open and Online Course (MOOC), English Pronunciation in a Global World. We describe our design choices and annotation strategies, highlighting key insights gained from collecting and analyzing learner speech data at a large-scale. In the second part of the talk, we discuss experiments using Automated Speech Recognition (ASR) models to support speech assessment. These experiments explore zero-shot ASR performance, ensemble approaches that combine multiple models, and continued pretraining to enhance accuracy and robustness. Together, these efforts aim to advance scalable, data-driven approaches to spoken language learning and assessment.
16:30-17:00 Discussion
17:00-17:30 CLOSING (Sylvain Coulange, Univ. Grenoble Alpes, France)
Printable version of the program
Unable to display PDF file. Download instead.
Contact
Sylvain Coulange
sylvain.coulange
univ-grenoble-alpes.fr (sylvain[dot]coulange[at]univ-grenoble-alpes[dot]fr)



