Motivation: Conversational agents offer promising opportunities for education as they can fulfill various roles (e.g., intelligent tutors and service-oriented assistants) and pursue different objectives (e.g., improving student skills and increasing instructional efficiency), among which serving as an AI tutor is one of the most prevalent tasks. Recent advances in the development of Large Language Models (LLMs) provide our field with promising ways of building AI-based conversational tutors, which can generate human-sounding dialogues on the fly. The key question posed in previous research, however, remains: How can we test whether state-of-the-art generative models are good AI teachers, capable of replying to a student in an educational dialogue?

In this shared task, we will focus on educational dialogues between a student and a tutor in the mathematical domain grounded in student mistakes or confusion, where the AI tutor aims to remediate such mistakes or confusions, with the goal of evaluating the quality of tutor responses along the key dimensions of tutor’s ability to (1) identify student’s mistake, (2) point to its location, (3) provide the student with relevant pedagogical guidance, that is also (4) actionable. Dialogues used in this shared task include the dialogue contexts from MathDial (Macina et al., 2023) and Bridge (Wang et al., 2024) datasets, including the last utterance from the student containing a mistake, and a set of responses to the last student’s utterance from a range of LLM-based tutors and, where available, human tutors, aimed at mistake remediation and annotated for their quality.  

Tracks: This shared task will include five tracks. Participating teams are welcome to take part in any number of tracks.

  • Track 1 - Mistake Identification: Participants are invited to develop systems to detect whether tutors’ responses recognize mistakes in students’ solutions.
  • Track 2 - Mistake Location:Participants are invited to develop systems to assess whether tutors’ responses accurately point to genuine mistakes and their locations in the students’ responses.
  • Track 3 - Pedagogical Guidance: Participants are invited to develop systems to evaluate whether tutors’ responses offer correct and relevant guidance, such as an explanation, elaboration, hint, examples, and so on.
  • Track 4 - Actionability: Participants are invited to develop systems to assess whether tutors’ feedback is actionable, i.e., it makes it clear what the student should do next.
  • Track 5 - Guess the Tutor Identity: Participants are invited to develop systems to identify which tutors the anonymized responses in the test set originated from.

Participant registration: All participants should register using the following link.

Important dates:

  • March 12, 2025: Development data release
  • April 9, 2025: Test data release
  • April 23, 2025: System submissions from teams due
  • April 30, 2025: Evaluation of the results by the organizers
  • May 21, 2025: System papers due
  • May 28, 2025: Paper reviews returned
  • June 9, 2025: Final camera-ready submissions
  • July 31 and August 1, 2025: BEA 2025 workshop at ACL

Shared task website: https://sig-edu.org/sharedtask/2025  

Organizers:

  • Ekaterina Kochmar (MBZUAI)
  • Kaushal Kumar Maurya (MBZUAI)
  • Kseniia Petukhova (MBZUAI)
  • KV Aditya Srivatsa (MBZUAI)
  • Justin Vasselli (Nara Institute of Science and Technology)
  • Anaïs Tack (KU Leuven)

Contact: bea.sharedtask.2025@gmail.com