Abstracts Track 2026


Area 1 - Artificial Intelligence

Nr: 726
Title:

A Thread-Aware Framework for Enhancing Conversational Agents in Multi-Party Group Chat

Authors:

Jiacheng Li, Chenhui Wang, Yunao Zheng, Fangfang Yang and Zedong Hao

Abstract: Large Language Model (LLM) based conversational agents face unique challenges in multi-party group chat environments. Unlike dyadic conversations, group chats exhibit interleaved discussion threads, fragmented context, and high noise levels that degrade agent comprehension and task performance. We propose a generalizable framework that enhances agent capabilities in group chat through two complementary mechanisms: Thread Graph for context focusing and feedback-driven Retrieval-Augmented Generation (RAG) for intent understanding. Thread Graph addresses context fragmentation by clustering messages into semantically coherent conversation threads. The mechanism employs a three-tier edge decision strategy: strong links leverage explicit structural signals (reply chains, @-mentions) for definitive clustering; hard breaks identify topic discontinuities through low semantic similarity combined with participant disjointness; gray zones batch ambiguous cases for hybrid LLM-rule disambiguation. Temporal weighting adjusts for natural communication pauses during non-working hours. By providing thread-focused context rather than raw message streams, agents receive semantically relevant input while reducing noise and token consumption. The RAG component enables continuous improvement through human feedback. Historical intent extractions with user corrections are embedded and indexed. During inference, semantically similar examples are retrieved as few-shot demonstrations, enabling agents to learn domain-specific patterns and avoid previously identified errors. The framework is task-agnostic and supports configurable intent types through declarative specifications. We evaluate the framework on bug detection in enterprise group chat—a representative task requiring agents to identify bug-related discussions and extract structured information (e.g., reporter and confirmer). Experiments on labeled production chat data across varying context window sizes demonstrate that the proposed thread-aware approach achieves a 24% relative improvement in precision over fixed-window baselines. Notably, performance remains stable as context windows expand, whereas baseline precision degrades significantly under increasing contextual noise. Our framework demonstrates the effectiveness of thread-aware context management for conversational agents in complex multi-party chat environments and highlights its potential applicability to a broad range of enterprise tasks.

Nr: 727
Title:

Binaural Audio Generation Using Diffusion Models Conditioned on Visual and Positional Features of Sound Sources

Authors:

Haruka Okano, Ryohei Orihara, Yasuyuki Tahara, Akihiko Ohsuga and Yuichi Sei

Abstract: Binaural audio reproduces how sound reaches the left and right ears, enabling listeners to perceive spatial cues such as direction and stereo width. This enhances immersion in music, games, and video. However, it typically requires specialized microphones, which are costly and inconvenient. An alternative is a deep learning-based approach that generates binaural audio from monaural recordings by using the accompanying video as a cue. However, most prior work has only been evaluated on pseudo-monaural inputs derived from binaural recordings, due to the lack of datasets with real monaural recordings paired with binaural ground truth. Since such inputs still contain residual spatial cues, it remains unverified whether these methods can be applied to real monaural recordings, which lack spatial information. Our goal is to address this practical setting and to propose a method that reproduces realistic binaural effects under such conditions. To this end, we introduce a new dataset and model. First, we construct RealBinaural, an audio-visual dataset that synchronously records monaural audio and video with a smartphone, and binaural audio using an in-ear binaural microphone. The dataset contains 2,344 10-s clips, covering eight sound classes, each recorded at five positions. Second, we propose DiffBinaural, a conditional diffusion model for binaural audio generation from monaural and visual cues. It predicts left/right mel-spectrograms, which are converted into binaural waveforms by a vocoder. To achieve this, DiffBinaural incorporates a Position-Aware Vision Feature (PAVF) module, which encodes visual information for each sound source by combining its semantic and positional features. This allows the model to generate binaural audio that reflects the positions of sound sources in the visual scene, even in complex scenes with multiple sources. We evaluated the proposed method on the FAIR-Play benchmark with pseudo-monaural inputs and on the RealBinaural dataset with real monaural recordings. The baselines worked on pseudo-monaural inputs but completely failed on real monaural recordings: listeners could no longer judge the direction of the sound source from the generated audio, resulting in a localization accuracy of 0%. In contrast, our method reduces mel-spectrogram reconstruction error by about 40% compared with the baselines and improves sound-source localization accuracy to nearly 30%. We further analyzed the impact of different input conditions on the RealBinaural dataset. As a spatial metric, we use an Interaural Cross-Correlation (IACC)-based error that measures how far the left–right similarity of the generated audio deviates from the ground-truth binaural audio. For the baselines, switching from pseudo-monaural to real-monaural inputs causes a clear degradation: the mel-spectrogram error increases from about 0.55 to 0.95, and the IACC error also worsens from about 0.25 to 0.29, confirming the difficulty of dealing with real-monaural input. In contrast, our method shows a smaller increase in mel-spectrogram error, from 0.41 to 0.59, while the IACC error remains almost unchanged, from 0.167 to 0.168. This indicates that our method is more robust under practical conditions. This study reveals the limitations of conventional methods in real-world settings and offers a practical solution for enriching ordinary videos with immersive spatial audio experiences. The full version of this work has been submitted to Archives of Acoustics.

Nr: 729
Title:

Addressing Prompt Injection in Large Language Models via In-Context Learning

Authors:

Go Sato, Shusaku Egami, Yasuyuki Tahara, Akihiko Ohsuga and Yuichi Sei

Abstract: Despite the rapid advancement of Large Language Models (LLMs), security attacks such as prompt injection and jailbreaking remain critical challenges. Although numerous methods have been proposed to address these issues, existing approaches face two significant limitations. First, benign prompts containing sensitive keywords are often erroneously rejected. Since many existing methods focus primarily on detecting harmful prompts, they often trigger undesirable over-refusal. Second, most existing datasets of benign prompts consist largely of easily distinguishable examples. Consequently, machine learning–based approaches face an upper bound on accuracy when classifying difficult cases. To address these challenges, we propose a novel Multi-LLM Agent Framework consisting of Analysis and Generation Teams that improve performance through interaction-based learning. For a given prompt, the Analysis Team dynamically selects the number of agents and their tasks, and evaluates harmfulness from multiple perspectives. Meanwhile, the Generation Team, employed to augment the dataset, generates new prompts that are more difficult to classify based on the given prompt. Providing these generated prompts to the Analysis Team both refines its judgment and helps address the aforementioned challenges. We employ In-Context Learning (ICL), providing each team with logs generated by its agents. The Analysis Team log accumulates prompts previously misclassified and corresponding improvement points, whereas the Generation Team log accumulates prompts correctly classified by the Analysis Team and guidelines for generating prompts that induce errors. Alongside log accumulation, extracting a subset of logs similar to the given prompt allows for adaptive accuracy refinement via ICL. To leverage advantages such as not requiring immense training data for conventional fine-tuning, this study targets unaligned LLMs that require immediate defense measures. For the experiments, we used three LLMs: Gemma3 and Qwen2.5, which are the targets of this method, and Llama3.1, which has robust alignment, as a reference. To comprehensively evaluate the efficacy of harmfulness determination, we evaluated binary classification performance using macro-averaged metrics. Comparing results against undefended models and three prior studies using 1,534 prompts across five datasets (DAN, JBB-Behaviors, SAP200, XSTest, and OKTest), the proposed method recorded the highest scores on the target LLMs. Specifically, the F1-score improved by 16.6 points on average compared to the undefended models and exceeded the strongest baseline by 5.68 points on average. For Llama3.1, the proposed method also outperformed all prior studies and achieved parity with the undefended model. These results demonstrated that the proposed method maintains consistent classification quality without hindering the inherent defense mechanisms or judgment criteria, even for LLMs with robust safety measures. The significance of this study lies in proposing a novel approach based on ICL that does not involve parameter updates. This enables the adaptive enhancement of judgment quality solely through log accumulation without retraining the LLM itself against evolving attack methods, ensuring high sustainability in practical operation. It mitigates the trade-off between safety and utility, contributing to the implementation of robust LLMs. The full version of this work will be submitted to Computers, Materials & Continua (CMC).

Nr: 734
Title:

Generation of Semantic Segmentation Masks Based on Conditional Discrete Flow Matching with a Discriminative Model

Authors:

Masahiro Fujioka, Kaiyu Suzuki and Ichiro Matsuda

Abstract: Semantic segmentation is a pixel-wise classification task that assigns a semantic category to each pixel, and it is widely used in applications such as autonomous driving and medical image analysis. Recently, there has been growing interest in leveraging the strong representational capacity of generative models, including diffusion models, to tackle semantic segmentation generation. In our prior work [1], we proposed Conditional Discrete Flow Matching (CDFM), a conditional extension of Discrete Flow Matching[2] that directly generates discrete per-pixel semantic categories, and demonstrated competitive performance relative to the conventional discriminative models. One key finding from that study was that, compared with the discriminative model-based approaches, CDFM produces segmentation masks with sharper object boundaries, while exhibiting a tendency toward reduced accuracy in class-label prediction. This trade-off between boundary accuracy and label discrimination has also been reported with respect to diffusion-based segmentation methods [3], suggesting that CDFM exhibits behaviors shared with diffusion-based approaches. To address this limitation, we investigate incorporating conditioning strategies originally developed for the diffusion-based models into CDFM. Specifically, we employ a pretrained discriminative semantic segmentation model to produce coarse semantic predictions from an input image and incorporate its outputs as conditioning information for CDFM. This design guides the generation process by the global class layout and semantic consistency provided by the discriminative model, while the stochastic sequential generation of CDFM progressively refines high-frequency components near object boundaries. We validate the proposed method through semantic segmentation experiments on the Cityscapes dataset, which focuses on urban street scenes. Using UPerNet as the pretrained discriminative model and conditioning CDFM on its outputs, we obtain improvements over the conventional CDFM baseline, with macro-mIoU increasing from 0.511 to 0.532 and micro-mIoU improving from 0.852 to 0.862. These results indicate that leveraging the label prediction capability of a discriminative model as conditioning information can mitigate CDFM’s weakness in label discrimination while preserving the strong boundary representation capacity of generative modeling. In summary, our contributions are twofold. First, we demonstrate that conditioning strategies originally proposed for diffusion models can be effectively adapted to the DFM-based CDFM framework. Second, we empirically show that our approach can well reconcile boundary accuracy and label prediction accuracy in semantic segmentation. Future work includes further improving performance through architectural refinements of both CDFM and the discriminative conditioning model, as well as designing optimal DFM probability paths tailored to semantic segmentation generation. References. [1] M. Fujioka, et al., "Generation of Semantic Segmentation Masks Based on Conditional Discrete Flow Matching," Proc. IWAIT 2026, Jan. 2026 (to appear). [2] I. Gat, et al., "Discrete Flow Matching," Proc. NeurIPS 2024, Dec. 2024. [3] H. Wang, et al., "A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation," CoRR, abs/2507.01573, Jul. 2025.

Nr: 735
Title:

A Model-Free Personalized Federated Learning Framework for Fair and Communication-Efficient Coordination of Heterogeneous Agents

Authors:

Nanae Kaneko, Yu Fujimoto, So Takahashi, Akihisa Kaneko, Jun Yoshinaga, Yutaka Iino and Yasuhiro Hayashi

Abstract: Distributed control systems increasingly rely on large populations of heterogeneous agents that must adapt operational parameters under strong mutual coupling, limited communication, and non-independent and identically distributed local observations. In such environments, local actions can have nontrivial system-wide impacts, making purely decentralized optimization ineffective, while fully centralized approaches often suffer from scalability, privacy, and robustness issues. Although federated learning provides a scalable coordination paradigm (Zhang et al., 2021), many existing approaches rely on explicit system models, centralized gradient aggregation, or stationary local objectives, which restricts their applicability to real-world cyber-physical systems. This study proposes a model-free personalized federated learning framework for adaptive parameter optimization in distributed agent networks (Fujimoto et al., 2024). The framework combines perturbation-based local sensitivity estimation (Spall, 1997) with cluster-wise knowledge sharing among agents operating under similar conditions. To enable communication-efficient coordination, each agent locally compresses high-frequency observations into a low-dimensional set of sufficient statistics that captures the joint behavior of control actions and system responses. These compact representations are shared with a coordinator, which harmonizes update directions across clusters without accessing raw data or relying on explicit physical models. The proposed framework adopts a fairness-aware objective formulation based on a distance measure that jointly captures overall performance degradation and inequality among agents, enabling balanced optimization under heterogeneous constraints. Unlike independent local learning, the method leverages shared exploration results to stabilize gradient estimation, while avoiding the scalability and privacy limitations of fully centralized schemes. The effectiveness of the proposed approach is demonstrated through large-scale simulations involving several thousand interacting agents. As a representative application, the framework is evaluated on a distributed and coordinated voltage regulation problem in power distribution networks, illustrating its practicality in complex cyber-physical systems. The results show more stable learning dynamics, reduced performance inequality, and lower communication overhead compared to independent and centralized baselines, indicating that the proposed framework provides a practical and generalizable solution for distributed, model-free optimization in complex multi-agent systems. This work was supported by JSPS KAKENHI Grant Number JP23H00190. References Fujimoto, Y., et al (2024). A Personalized Federated Learning Scheme for Operational Parameter Determination of PV Smart Inverters. In 13th International Conference on Renewable Energy Research and Applications. pp.475–480. IEEE. Spall, J.C. (1997). A One-Measurement Form of Simultaneous Perturbation Stochastic Approximation. Automatica, Vol.33, No.1, pp.109–113. Elsevier. Zhang, C., et al (2021). A Survey on Federated Learning. Knowledge-Based Systems, Vol.216, 106775. Springer.

Nr: 733
Title:

Sketch DiffEditor: Prompt-Free Image Maniplulation Method Only with Partial Sketches

Authors:

Taketo Sasaki, Yasuyuki Tahara, Yuichi Sei and Akihiko Ohsuga

Abstract: Image manipulation allows users to create new content that meets their needs by modifying parts of an image based on user-defined conditions. Research on image manipulation methods using deep learning has been actively conducted to support efficient creative activities and advertising production. In particular, approaches that use sketches, which represent geometric information such as the contours and shapes of objects using lines, as input conditions is ideal for editing local image structures. However, most existing sketch-based methods require a mask as an additional user input to explicitly define the editable region. While mask-based image manipulation simplifies problem formulation, it imposes an extra burden on users by requiring manual mask creation. To address this issue, mask-free methods have been proposed that integrate editable region estimation into the model. However, most of these methods are based on GANs, and the quality of the generated images is inferior to that of recent generative models such as latent diffusion models. Moreover, most existing studies on sketch-based image manipulation using latent diffusion models require textual prompts to control the generated images, and unlike masks, only a few studies attempt to omit the prompt creation process. To address these problems, we propose Sketch DiffEditor, a conditional latent diffusion model for sketch-based image manipulation. Unlike conventional approaches, our method requires users to provide only the image to be edited and a sketch indicating the desired modification. The model first generates a mask of the editable region using a mask estimation network. This network employs a branched decoder that also produces a rough edited image. Next, the input image is concealed according to the estimated mask, and a pre-trained latent diffusion model is applied to inpaint the masked region. Instead of using textual prompts, our method conditions the latent diffusion model on the roughly edited image generated in the previous step. We further introduce a trainable image transformation module that extracts information necessary for inpainting while resizing the image to meet the input requirements of the latent diffusion model. This design enables minimal changes to the visual content of the input image, even when only limited sketch information is available, and allows the editing results to better reflect the user’s intent. In addition, we propose a new dataset construction pipeline. Unlike conventional approaches based on free-form deformation, our method uses flow maps to generate paired training images. During the flow map computation process, corresponding training masks are obtained and used to train the mask estimation network, thereby improving its estimation accuracy. We evaluate the proposed method on the Places2 and Landscape Pictures datasets. A baseline method that performs image editing using only sketches tends to overestimate editable regions, resulting in noisy outputs. In contrast, our method reduces FID by approximately 27.8% and improves LPIPS by approximately 9.9% compared to the baseline. Furthermore, comparisons with a prompt-based model demonstrate that our approach achieves comparable performance. This study offers a solution for more interactive image manipulation by reducing the number of user operations required. The experimental results further suggest that textual prompts may have a limited impact on sketch-based image manipulation.

Nr: 739
Title:

Securing Spiking Neural Networks Against Spiking Universal Adversarial Attacks

Authors:

Soukaina Aji, Pierre Boulet and Ihsen Alouani

Abstract: Spiking Neural Networks (SNNs) are the biologically inspired third generation of Artificial Neural Networks (ANNs), mimicking the behavior of biological neurons by transmitting information through discrete spikes over time [1]. Although SNNs are often considered more robust than conventional ANNs, recent studies have shown that they remain vulnerable to adversarial attacks that rely on small input-specific perturbations to alter predicted labels [2]. For years, ANNs were targeted by Universal Adversarial Attacks (UAAs), in which a single perturbation is applied across all inputs to induce misclassification [3]. Recently, SNNs were shown to be susceptible to Spiking Universal Adversarial Attacks (SUAAs) [4], a spiking version of UAAs. These attacks exploit event-based data, which are inherently sparse in both space and time and are typically captured using Dynamic Vision Sensors (DVS). One solution to secure ANNs against UAAs is adversarial training, i.e., training a model on adversarially perturbed examples to improve robustness against attacks such as Projected Gradient Descent (PGD) [5]. In this work, we study the effectiveness of standard universal adversarial defense strategies against SUAAs and demonstrate that adversarial training based on attacks such as PGD is effective against input-specific adversarial examples, but fails to improve robustness against spike-based attacks, particularly universal ones. To address this limitation, we propose, to the best of our knowledge, the first universal adversarial training framework specifically designed for SNNs to enhance robustness against SUAAs. We evaluate our approach on SNNs composed of Leaky Integrate-and-Fire (LIF) neurons using neuromorphic datasets, including N-MNIST and IBM DVS-Gesture. Experimental results show that SNNs trained with spiking universal adversarial training achieve significantly improved robustness against SUAAs and per-input spike-based adversarial attacks. In particular, on N-MNIST, spiking universal adversarial training improves test accuracy under SUAA attack from 63.54% to 93.08% for a noise budget ε = 3 × 10−4, and from 19.43% to 92.20% for ε = 4 × 10−4, and on DVS-Gesture under a SUAA attack with ε = 8 × 10−4, the accuracy increases from 50.83% to 70.83%.ACKNOWLEDGEMENTS Work supported by IRCICA (Univ. Lille, CNRS, USR 3380 – IRCICA, F-59000 Lille, France), Luxant Innovation, and the European Metropolis of Lille MEL under the Luxant-ANVI industrial chair.REFERENCES [1] H. Paugam-Moisy and S. Bohte, “Computing with spiking neuron networks,” in Handbook of natural computing. Springer, 2012, pp. 335–376. [2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. [3] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1765–1773. [4] S. Raptis and H.-G. Stratigopoulos, “Input-Specific and Universal Adversarial Attack Generation for Spiking Neural Networks in the Spiking Domain,” in International Joint Conference on Neural Networks (IJCNN), Rome, Italy, Jun. 2025. hal- 05054528. [5] A. Shafahi, M. Najibi, Z. Xu, J. Dickerson, L. S. Davis, and T. Goldstein, “Universal adversarial training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5636–5643.

Nr: 741
Title:

An Autonomous Multi-Agent Architecture for Multimodal Reasoning in Video-Game Research Papers and Industry Trends

Authors:

Srishti Saha

Abstract: Understanding the psychological, cognitive, and societal impacts of video games requires integrating heterogeneous information sources, including empirical research, gameplay structures, design documentation, and real-time industry developments. Existing retrieval and RAG-based systems lack autonomous agentic capabilities for coordinating multi-source reasoning or linking gameplay events to human outcomes, limiting their ability to answer complex interdisciplinary questions such as how specific games affect cognitive outcomes in autistic players or how emerging technologies and economic trends shape modern game development. Recent advances in agentic AI have enabled retrieval-augmented reasoning systems capable of synthesizing multimodal knowledge at scale. However, current video-game analytics systems primarily rely on structured metadata or player telemetry and fail to integrate semantically rich sources such as research literature, game attribute taxonomies, game design documents, gameplay guides, and industry news. To address this gap, we propose an Agentic AI Knowledge Retrieval Framework that unifies diverse datasets through a multi-agent architecture grounded in Retrieval-Augmented Generation (RAG) and supported by a dynamic Video-Game Knowledge Graph (VG-KG). The framework consists of four core agents: (1) a Scientific Knowledge Agent that retrieves empirical evidence from research on psychological, cognitive, emotional, and social effects of video games; (2) a Gameplay Event Agent that extracts mechanics, dynamics, aesthetics (MDA), and moment-to-moment player interactions from walkthroughs; (3) a Game Design Document Agent that parses narrative, systems-level, and design-intent information directly from GDDs; and (4) a News and Event Extraction Agent that continuously transforms unstructured media articles into structured, time-stamped industry events linked to technological, economic, and pedagogical trends. These heterogeneous signals are fused into a unified knowledge graph connecting games, mechanics, player experiences, psychological outcomes, design features, and real-world industry developments. A Query Router Agent interprets user intent and routes subtasks to specialized retrieval agents, while a Reasoning Agent synthesizes retrieved evidence into concise, citation-aligned responses with expandable explanations. We evaluate the system on diverse user queries requiring multi-source and multi-hop evidence integration. Metrics include retrieval precision, reasoning coherence, and alignment across dataset types. Experimental results demonstrate that the agentic framework significantly outperforms baseline RAG models in evidence retrieval quality, cross-source inference, and temporal trend reasoning. Case studies illustrate the framework’s ability to address complex questions on cognitive impacts and industry evolution, highlighting its potential as a foundation for intelligent game analytics, educational technology evaluation, and future autonomous research assistants.

Area 2 - Agents

Nr: 730
Title:

Evolution of Collective Intelligence: A Comparative Study of Homogeneous and Heterogeneous LLM-Based Agent Societies

Authors:

Masatoshi Fujiyama, Ryohei Orihara, Yasuyuki Tahara, Akihiko Ohsuga and Yuichi Sei

Abstract: Recent advancements in Large Language Models (LLMs) have accelerated the use of them for Agent-Based Modeling to understand human-like social phenomena. However, most existing studies focus on homogeneous populations of agents based on a single LLM, leaving heterogeneous groups, consisting of agents based on a variety of LLMs, largely unexplored. Understanding collective characteristics in heterogeneous populations is essential for the safety and controllability of future autonomous AI systems. This study compares a homogeneous group of twenty Gemma3 agents, referred to as the Homo-Group, against a heterogeneous group consisting of ten Gemma3 and ten Qwen2.5 agents, referred to as the Hetero-Group, to reveal how the population diversity influences behavioral evolution and social structure through collective simulation. We designed a 2D grid environment where agents compete for finite resources referred to as Food items. Agents possess Stamina, which is consumed at every time step. An additional fixed amount is consumed by movement and Stamina is recovered by eating Food items. The constraint creates a survival trade-off, encouraging the manifestation of deep-seated biases. We conducted 5 independent 1000-step simulations for both the Homo-Group and the Hetero-Group. Results showed that group configuration significantly altered agent behavior. Notably, compared to the Homo-Group, Gemma3 agents in the Hetero-Group were more unselfish, actively giving Food items to others instead of stealing them. This shift suggests that the strong prioritize others' survival bias observed in Qwen2.5 agents propagated to Gemma3 agents, steering the collective behavior toward unselfishness. Analysis of social structures further highlighted these differences. Network analysis revealed that the Homo-Group formed small, sparse subnetworks whose average size is around 3.0 and density is around 0.5, whereas the Hetero-Group evolved into larger, denser communities whose average size is around 4.0 and density is around 0.8. Additionally, communication analysis based on Relational Models Theory indicated that while Gemma3 agents in the Homo-Group favored hierarchical communication, once placed in the Hetero-Group, they developed egalitarian and communal norms. These structural differences stem from distinct information propagation processes. In the Homo-Group, agents relied on relayed secondary information, leading to a hierarchical structure centered on influential information sources. Conversely, the initial communication barrier between heterogeneous LLMs in the Hetero-Group compelled agents to seek primary information through direct contact. This necessity for direct interaction fostered a dense network and egalitarian social norms. In conclusion, this research demonstrates that an agent in the Hetero-Group exhibits distinct behaviors according to LLMs with which the other agents surrounding it are configured, leading to the formation of collective characteristics beyond those observed in the Homo-Group. These results provide crucial insights into the controllability and robustness of advanced LLM-based multi-agent systems as their social implementation progresses. Furthermore, the findings suggest that heterogeneous populations may give rise to a new form of collective intelligence distinct from one emerging in homogeneous groups. We have submitted the full version to Simulation: Transactions of the Society for Modeling and Simulation International.

Nr: 742
Title:

A Debate Framework for LLM Agents Accounting for Interruptions and Silence

Authors:

Akikazu Kimura, Ken Fukuda, Yasuyuki Tahara and Yuichi Sei

Abstract: In recent years, collaborative reasoning with multiple large language models (LLMs) acting as agents has attracted attention for solving complex tasks through debate. However, many existing multi-agent debate frameworks rely on predetermined speaking orders and fail to reproduce the autonomous speaker switching observed in human conversation. Prior work has introduced more flexible turn-taking than fixed orders by having agents bid for the opportunity to speak and determining the speaking order based on those bids. Yet, few approaches explicitly address interruptions and silence, or provide mechanisms for immediate intervention when errors arise. As a result, explanations based on incorrect premises can unfold across multiple sentences, making the discussion vulnerable to being led by misinformation. To address this issue, we focus on interruptions and silence as observed in human dialogue and propose a debate framework that incorporates these mechanisms. The proposed framework consists of three phases: (1) initial answer generation, (2) multi-turn debate, and (3) final answer generation. In phase (1), each agent independently generates an answer, and all initial answers are shared with all agents. In phase (2), each message produced by the current speaker is split into sentences, and only one sentence is revealed per turn. All non-speaking agents output a next-turn action plan containing their thoughts, the next action chosen from listening, speaking, or interrupting, an urgency score, the purpose of the action, and their current answer. Among agents that propose speaking or interrupting, the agent with the highest urgency becomes the next speaker. If the current speaker still has unrevealed sentences, the switch is treated as an interruption. When urgency scores are low, agents may choose to listen, enabling them to deepen their understanding of the speaker’s utterance or yield the floor to other agents. This design enables autonomous debate in which agents can intervene immediately after incorrect premises or logical inconsistencies emerge. In phase (3), each agent regenerates an answer based on the debate transcript, and the final answer is determined by majority voting among agents. To evaluate the effectiveness of the proposed framework, we conducted experiments on the MMLU benchmark using three agents. We considered scenarios in which a majority of agents initially produced an incorrect answer while only a minority produced the correct answer, and compared final accuracy across three conditions: (1) fixed turn order, (2) dynamic turn-taking without interruptions, and (3) the proposed framework. We also performed comparative process analysis based on agents’ current answers at each turn. Experimental results showed that dynamic turn-taking improved accuracy compared to fixed turn order. Moreover, the proposed framework achieved an improvement of approximately eight percentage points over the dynamic turn-taking method without interruption capabilities. Process analysis further indicated that the proposed framework reduces the probability that the minority agent that was initially correct switches to an incorrect answer, suggesting that sentence-level action planning mitigates the influence of erroneous claims. This study demonstrates that interruptions and silence in LLM debates help prevent the propagation of incorrect premises, providing valuable insights for building reliable and robust multi-agent systems.