# Focus Agent: LLM-Powered Virtual Focus Group
**Authors**: Taiyu Zhang, Xuesong Zhang, Robbe Cools, Adalberto L. Simeone
> 1234-5678-9012 KU Leuven Naamsestraat 22 Leuven Belgium 3001
> KU Leuven Naamsestraat 22 Leuven Belgium 3001
> KU Leuven Leuven Belgium
(2024)
## Abstract
In the domain of Human-Computer Interaction, focus groups represent a widely utilised yet resource-intensive methodology, often demanding the expertise of skilled moderators and meticulous preparatory efforts. This study introduces the “Focus Agent,” a Large Language Model (LLM) powered framework that simulates both the focus group (for data collection) and acts as a moderator in a focus group setting with human participants. To assess the data quality derived from the Focus Agent, we ran five focus group sessions with a total of 23 human participants as well as deploying the Focus Agent to simulate these discussions with AI participants. Quantitative analysis indicates that Focus Agent can generate opinions similar to those of human participants. Furthermore, the research exposes some improvements associated with LLMs acting as moderators in focus group discussions that include human participants.
Human-computer Interaction, Intelligent Virtual Agent, Virtual Focus Group, Multi Agent Simulation journalyear: 2024 copyright: acmlicensed conference: ACM International Conference on Intelligent Virtual Agents; September 16–19, 2024; GLASGOW, United Kingdom booktitle: ACM International Conference on Intelligent Virtual Agents (IVA ’24), September 16–19, 2024, GLASGOW, United Kingdom doi: 10.1145/3652988.3673918 isbn: 979-8-4007-0625-7/24/09 ccs: Computing methodologies Multi-agent planning ccs: Human-centered computing User studies ccs: Human-centered computing Virtual reality
## 1. Introduction
In the domain of qualitative research, focus groups have emerged as a widely adopted methodology and are extensively employed in both industrial and academic contexts (Kitzinger, 1994, 1995; Mazza, 2006), thanks to its structured group discussions aimed at gaining in-depth insights into specific issues. Within Human-Computer Interaction (HCI), researchers routinely employ focus groups as a vital tool in project planning, evaluation, and data collection endeavours (Mazza, 2006; Troshani et al., 2021; Selter et al., 2023; Stalmeijer et al., 2014). Particularly noteworthy is the growing prominence of virtual focus groups, especially in the post-COVID-19 era (Keen et al., 2022). This transition towards virtual focus groups can be attributed to their blending a methodologically sound approach with the potential of engaging with geographically dispersed and otherwise challenging to access populations (Turney and Pocknee, 2005).
Organising a focus group presents two primary challenges: first, gathering so many people at the same time is not an easy task, especially when researchers are interested in exploring the lived experiences of diverse or hard to reach groups (Brüggen and Willems, 2009; Gratton and O’Donnell, 2011; Wirtz et al., 2019); second, the success of a focus group relies on an experienced moderator with domain-specific expertise. A moderator lacking experience can disrupt the discussion flow or gather unproductive data (Nagle and Williams, 2013). These issues have sometimes hindered the adoption of focus groups into certain HCI research efforts (Rosenbaum et al., 2002).
<details>
<summary>extracted/5830562/FocusGroupSimulation.png Details</summary>

### Visual Description
\n
## Diagram: Facilitated Discussion Cycle
### Overview
The image depicts a cyclical diagram illustrating a facilitated discussion process. The cycle consists of four stages: Planning, Questions, Discussion, and Reflection. A robot "Moderator" is central to the process, initiating the Planning stage and guiding the flow. The diagram uses arrows to indicate the direction of the cycle.
### Components/Axes
The diagram consists of the following components:
* **Stages:** Planning, Questions, Discussion, Reflection.
* **Central Figure:** A robot labeled "Moderator".
* **Participants:** A group of cartoon figures seated around a table.
* **Arrows:** Indicate the flow between stages.
* **Labels:** "Planning" (red), "Questions" (grey), "Discussion" (blue), "Reflection" (red-dashed).
### Detailed Analysis or Content Details
The diagram shows a circular flow.
1. **Planning (Top):** A robot labeled "Moderator" stands within a red circular shape labeled "Planning". Red lines radiate outwards from the circle, suggesting outward projection or dissemination of plans.
2. **Questions (Right):** A grey arrow points from the "Planning" stage to a group of speech bubbles, representing "Questions".
3. **Discussion (Bottom):** A blue circular shape labeled "Discussion" contains a group of cartoon figures seated around a table. The figures appear to be engaged in conversation.
4. **Reflection (Left):** A red dashed arrow points from the "Discussion" stage back to the "Planning" stage, labeled "Reflection". This suggests a feedback loop where insights from the discussion inform future planning.
The "Moderator" robot is holding a clipboard and a pen, indicating a role of recording and managing information. The participants in the "Discussion" stage are diverse in appearance, suggesting inclusivity.
### Key Observations
The diagram emphasizes the iterative nature of the discussion process. The "Reflection" stage is visually represented as a dashed line, potentially indicating that it is a less formal or more nuanced step than the others. The "Moderator" plays a central role in initiating and guiding the process.
### Interpretation
This diagram illustrates a structured approach to facilitated discussions, likely within a team or organizational context. The cycle suggests a continuous improvement model where planning leads to questions, questions lead to discussion, and discussion leads to reflection, which then informs future planning. The presence of a "Moderator" highlights the importance of a neutral facilitator in guiding the process and ensuring productive dialogue. The use of cartoon figures and bright colors suggests a positive and collaborative environment. The dashed line for "Reflection" could indicate that this stage requires more critical thinking and analysis, and is not simply a mechanical step in the process. The diagram is a visual representation of a process, not a presentation of data. It is a conceptual model for how to conduct a meeting or brainstorming session.
</details>
Figure 1. The AI moderator generates questions according to the discussion content and plan, while AI Participants discuss the prompt from the moderator.
The advent of Large Language Models (LLMs), such as ChatGPT, offers a potential solution. These models can frequently communicate in text, generate diverse content from various perspectives based on the large scale of text information on the internet (Reynolds and McDonell, 2021; Brown et al., 2020), and demonstrate expertise across several fields, including social sciences, healthcare, and education (Koubaa, 2023; Sallam, 2023). Their capabilities extend to assisting with paper writing (Katar et al., 2023; Ciaccio, 2023), providing legal advice (Katz et al., 2024; Nay et al., 2024), and supporting medical inquiries (Haupt and Marks, 2023). Given these advancements, focus groups, a classic qualitative data collection method, should benefit from LLMs. Despite their potential, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect (hallucinated) information (Wang et al., 2023b). Additional framework design is still necessary for multi-agent tasks, such as societal simulations (Park et al., 2023) or role-playing game simulations (Xu et al., 2023).
This work introduces the “Focus Agent”, an LLM-based moderator for focus groups that has two functions: 1) simulating discussions without human participants and collecting AI-generated opinions, and 2) guiding focus groups as a moderator as shown in Figure 1, with human participants as well. To address prevalent issues in multi-agent simulations, including repetitive opinions and the generation of irrelevant content, the “Focus Agent” employs a scheduled discussion format that divides the focus group into distinct stages, each corresponding to a specific topic. This method mirrors the strategies employed by experienced human moderators. Additionally, the framework incorporates reflection periods during discussion to counteract memory loss during the simulation, ensuring a coherent and productive discussion flow. When moderating focus groups with human participants, a multi-person Speech-to-Text (S2T) and Text-to-Speech (T2S) integration enables the “Focus Agent” to interact with multiple users simultaneously.
Our work primarily explores the application of LLMs in simulating focus group discussions. Two main Research Questions (RQs) are as follows:
RQ 1: To what extent do the opinions generated by a LLM align with those of human participants in focus group?
RQ 2: To what extent is a LLM effective in performing the duties of a moderator in focus group discussions?
To answer these RQs, we conducted a user study with 23 participants across five discussion groups. Participants engaged in a one-hour AI-moderated focus group discussion on the topic of “digital well-being”, followed by a 30-minute session led by a researcher to share their experiences, evaluate the AI moderator’s performance and collect feedback, which was referred as a meta focus group in our work. Meanwhile, the Focus Agent simulated the focus group discussions on the same topic with AI participants. Qualitative analysis including thematic analysis and content analysis of the transcriptions reveals that the AI simulation outputs the majority of opinions expressed by human participants. Additionally, we assessed the performance of the Focus Agent functioning as a moderator, both in the focus group simulation with AI participants as well as with focus groups involving human participants. Based on our findings, the Focus Agent meets the essential criteria required of a focus group moderator. This includes progressively guiding discussions from general to more specific topics and maintaining an actively engaged atmosphere, drawing on the fundamental literacy expected of a focus group moderator (Stewart and Shamdasani, 2014). However, when tasked with moderating discussions involving human participants, the agent’s ability to interact with humans seems constrained, and it has not demonstrated sufficient understanding of human conversation. We identified several limitations of current LLMs in managing multi-person discussions and offer suggestions for integrating AI agents into focus group more effectively. To promote further research, the code has been open-sourced https://github.com/AriaXR/FocusAgent.
## 2. Related Work
This section discusses previous research directly related to our study. We divided it into three subsections: Focus Group Development, Multi-Agent Simulation and Multi-speaker speech recognition for Voice-based Conversational Agents.
### 2.1. Focus Group Development
The utilisation of focus groups, or group depth interviews, is a cornerstone method within the realms of advertising, marketing, and HCI research due to its effectiveness in gathering qualitative insights (Stewart and Shamdasani, 2014). The earliest focus groups were conducted through face-to-face conversations, which make the organisation complex and time-consuming, even with a lot of fees for participant reimbursement (Rosenbaum et al., 2002). The popularity of online focus groups has augmented their appeal, offering advantages such as the convenience of participation from any location at any time, and anonymity, which reduces participants’ apprehension of judgement (Daniels et al., 2019; Wilkerson et al., 2014; Stewart and Shamdasani, 2017). Researchers inviting many people to participate in online meetings at the same time often encounter difficulties, such as inconsistent time schedules, time differences, and poor communication caused by network delays. To further facilitate users’ participation in focus groups, some social media platforms provide asynchronous text-based focus groups (Gordon et al., 2021; Biedermann, 2018; Richard et al., 2021; Wenzek et al., 2019). However, as participants do not contribute simultaneously, it brings some difficulties relating to such a reduced ‘spontaneity’ (Brüggen and Willems, 2009; Nicholas et al., 2010) including: shorter answers with fewer word counts (Chen and Neo, 2019); uneven flow during the interactions due to their lag (Veloso, 2020); and more unfocused exchanges that do not always address the relevant research question (Brüggen and Willems, 2009).
The recent advancements in LLMs, which are trained on extensive internet text data, offer novel opportunities for conducting focus groups. As an innovative retrieval model, LLMs have the potential to streamline the data collection process (Zhu et al., 2023). Utilising LLMs to simulate focus groups presents a simpler and potentially more efficient alternative to engaging human participants, thereby opening new avenues for qualitative research.
### 2.2. Multi-Agent Simulation
Despite the capability of LLMs to process one-on-one question-answer formats, their deployment in long term dialogues and opinion generation, such as focus group discussions, reveals some limitations. These challenges include difficulties in understanding complex instructions, hallucination of agents, a limited token memory leading to loss of continuity, repetitive dialogues, and the generation of meaningless conversation in long-term interactions (OpenAI, 2023; Xu et al., 2024).
To help solve these issues, recent research has come up with new ways to organise how these AI agents think and respond, tailored to specific kinds of tasks (Talebirad and Nadiri, 2023; Park et al., 2023). The Chain-of-Thought (CoT) principle is pivotal, serving as the foundational idea behind them (Wang et al., 2023a). By dissecting complex issues into simpler elements, it facilitates a collaborative approach among multiple agents to tackle each component, leading to a comprehensive solution. By decomposing complex problems into many simple parts, the solution is achieved through the combined efforts of multiple small agents. Additionally, the reflection mechanism plays a crucial role in addressing memory limitations and enhancing the authenticity of the generated content (Yan et al., 2024). This process involves storing detailed historical data as structured information, which can be referenced for more informed decision-making in future interactions. Moreover, to improve the consistency of agent performance across various contexts, some works have investigated the exploration of diverse prompting techniques tailored to the specific roles (Shanahan et al., 2023).
In our work, we have built upon insights from previous research to address potential challenges that could arise during focus group discussions. Furthermore, we have developed a novel framework for conducting focus groups, primarily guided by an AI moderator. The AI moderator facilitates simulated focus group discussions and aids in coordinating focus groups that include human participants. To bridge the interaction gap with human participants, we incorporate a voice-based conversational agent to the moderator.
### 2.3. Multi-speaker speech recognition for Voice-based Conversational Agents
Unlike text-based chatbots, Voice-based Conversational Agents (VCAs) necessitate an extra technological layer for operation: they use a speech-to-text (S2T) process to interpret spoken inputs and a text-to-speech (T2S) system for generating spoken responses (Jokinen and McTear, 2022; Rough and Cowan, 2020). This integration allows VCAs to facilitate interactions in a more natural, conversational manner, bridging the gap between human users and digital assistants.
However, current S2T technologies, such as Google’s API or OpenAI’s Whisper, encounter difficulties in long-term group discussions such as focus groups (Radford et al., 2022). One challenge with using S2T technologies like Whisper for multi-participant discussions is the duration limit on voice recording inputs, which is considerably less than the typical length of conversations. A potential solution involves segmenting longer discussions into shorter fragments using Voice Activity Detection (VAD), which helps manage recordings more effectively (Bain et al., 2023). Another limitation is lack of speaker differentiation, a critical feature for understanding who is speaking in group discussions. Some research has attempted to identify individual speakers by analysing the unique timbre of their voices (Medennikov et al., 2020; Horiguchi et al., 2021, 2020). However, these methods often fall short in accuracy due to the absence of prior information about the speakers. A more effective approach involves using a pre-recorded sample from each speaker, enabling a retrieval-based method to significantly improve performance by accurately distinguishing between speakers (Desplanques et al., 2020).
In our work, we improved Whisper, an open-source S2T model, with a retrieval-based technique, optimising it for multi-participant discussions such as focus groups.
## 3. Focus Agent Implementation
Our Focus Agent was designed to simulate focus group discussions and facilitate running sessions involving human participants. For the focus group simulation, we devised a multi-agent framework, complemented by a moderator to oversee the entire focus group process. This ensures that the contributions from AI participants are both relevant and valuable. Regarding interactions with actual human participants, we incorporated S2T and T2S systems into the AI moderator, enabling voice-based communication.
### 3.1. Focus Group Simulation
<details>
<summary>extracted/5830562/AISimulation.png Details</summary>

### Visual Description
\n
## Meeting Documentation: Focus Group Transcript
### Overview
The image presents a transcript of a focus group discussion centered around "Digital Wellbeing: Balancing Screen Time and Mental Health." The transcript captures the dialogue between participants named Noah, Able, William, and David, facilitated by a moderator (presumably Able). The setting is a "Meeting Room" with a scheduled time of 60 minutes. A visual element depicts a group of people around a table, likely representing the focus group participants.
### Components/Axes
The image consists of:
* **Header:** "Meeting Room" title, "Focus Group Topic: Digital Wellbeing: Balancing Screen Time and Mental Health", "Scheduled Time: 60 minutes", "Simulation Time: 11 minutes".
* **Visual Element:** A cartoon-style illustration of four people seated around a table. The people are labeled "Noah", "Able", "William", and "David". There are also icons representing "Ideas", "Moderator", and "Planning".
* **Transcript:** A block of text containing the dialogue from the focus group participants.
### Detailed Analysis or Content Details
The transcript is as follows:
**Noah:** "On average, I spend about 4-5 hours on screens each day. My online activities vary but I’m particularly drawn to exploring alternative viewpoints and engaging in intellectual discussions on various platforms. While this has broadened my horizons and allowed me to connect with like-minded individuals, I’ve also noticed the potential for information overload and the negative impact it can have on my mental clarity. It’s a constant reminder to balance my online pursuits with offline experiences that rejuvenate my mind and soul."
**Able:** "Now that we’ve discussed our overall perspectives on digital wellbeing, let’s dive into our digital habits. How much time do you typically spend on screens each day, including smartphones, tablets, and TVs? And what are the activities you engage in most frequently online, such as social media, gaming, or streaming? Please share your experiences and any positive or negative impacts you’ve noticed from your digital habits."
**Noah:** "Personally, I’ve found that incorporating regular digital detoxes into my routine has been incredibly beneficial. One weekend a month, I completely disconnect from all screens and immerse myself in nature, engaging in activities that ground me and allow me to reconnect with myself. This intentional break from technology helps me recharge and gain a fresh perspective, reminding me of the importance of being present in the physical world. It’s a liberating experience that I would highly recommend to anyone seeking to strike a healthy balance between the digital realm and their mental wellbeing."
**Able:** "Is everyone to share about your experiences and perspectives on digital wellbeing. Building on your discussion so far, is it curious to know if any of you have tried any specific techniques or tools to limit screen time and improve your mental health?"
**William:** "Greetings from William in Malaysia. Coming from work as an Event Planner. Technology has revolutionized the event planning industry, making tasks more efficient and communication easier. However, I also believe in maintaining a healthy balance. What I find important is adapting and being flexible to embrace new technologies, while being also mindful of the impact it can have on our mental wellbeing. Let’s explore strategies that allow us to harness the benefits of technology without compromising our overall wellbeing."
**Able:** "It’s great to have such diverse perspectives in our group. Now, let’s hear from William. Would you like to introduce yourself and share your relationship with technology?"
**Noah:** "As a free-spirited and independent thinker, I’ve always valued personal freedom and individuality. When it comes to screen time and mental wellbeing, I believe it’s essential to set not only boundaries but also question the societal norms around technology. Sometimes going against the norm and following our own path can lead to a more fulfilling and balanced life. So, let’s explore unconventional strategies and challenge the status quo to pursue digital wellbeing."
**David:** "I completely agree with David. In my experience, there’s also a soothing habit to be found in our daily activities which can be helpful. For example, I like to allocate about an hour of time for professional work, then connect to social media for about 10 minutes. I can maintain a sense of control over my screen time and it’s a way to stay connected with friends and family."
### Key Observations
* The discussion revolves around the balance between the benefits and drawbacks of digital technology on mental wellbeing.
* Participants share personal experiences with screen time, digital detoxes, and strategies for maintaining a healthy relationship with technology.
* There's a recurring theme of intentionality and mindful usage of technology.
* The participants come from diverse backgrounds (Event Planner, Independent Thinker) and locations (Malaysia).
### Interpretation
The focus group transcript reveals a nuanced understanding of the impact of digital technology on mental health. Participants acknowledge the benefits of connectivity and access to information but also recognize the potential for overload and negative consequences. The emphasis on digital detoxes, mindful usage, and setting boundaries suggests a growing awareness of the need for a balanced approach. The diverse perspectives highlight the subjective nature of digital wellbeing and the importance of finding strategies that work for individual needs and lifestyles. The transcript demonstrates a desire to move beyond simply limiting screen time and towards a more holistic approach that integrates technology into a fulfilling and balanced life. The visual element reinforces the collaborative nature of the discussion, depicting participants engaged in a shared exploration of these complex issues.
</details>
Figure 2. A web demo of the Focus Group simulation system.
In accordance with the benchmark study conducted by OpenCompass (Contributors, 2023), the two most advanced Large Language Models (LLMs) available in the field at the time of writing are ChatGPT and GPT4. Pilot testing revealed that ChatGPT resulted in similar opinions compared with GPT4, after which we decided not to use the superior GPT4 due to its 20-fold increase in cost. Compared to direct prompts, our algorithmic framework improves the realism and comprehensiveness of the AI simulation, as corroborated in Figure 2.
Initially, we attempted to employ a singular prompt to simulate focus group discussions. However, concerning both content and length, the generated outcomes significantly deviated from our expectations. In response to these challenges, we introduce the framework of our Focus Agent, featuring an AI moderator to guide the discussion process. As shown in Figure 1, this AI moderator generates some plans to divide the whole discussion into multiple stages, aligning with the distinct topic and aims of the focus group. Based on these guidelines, the AI moderator then facilitates a simulated focus group discussion with other AI entities as participants. Throughout the conversation, the moderator actively engages in reflection, responding to the dialogue of the participants by timely introducing pertinent questions to foster further discussion. We explained this process in detail in the online appendix.
Within the simulated focus group, each participant represents an artificial intelligence entity. Experimenters are responsible for defining key parameters such as the topic, goals, overall duration, and specific characteristics of the participants, which include names, ages, occupations, nationalities, and personalities. In this setting, LLMs are tasked with understanding the context through assigned roles, typically categorised as system, user, and assistant. The system role involves attributing virtual personas to the LLMs, while the user and assistant roles are designed to aid in interpreting the context either from the viewpoint of the designated character or from that of others. To achieve this, we have developed a sequence of prompt designs, the details of which are provided in the online appendix.
To simulate the focus group discussion as realistically as possible, we designed the algorithm of both moderator and participants. The role of the moderator within the focus group simulation system encompasses the critical responsibilities of guiding and orchestrating the discussion, which includes managing time allocation and steering the discourse topics. These responsibilities are reflected in the moderator’s thought chain, elucidated in Algorithm 1. We added a reflection mechanism at the end of every stage to compress the context of previous discussion to avoid memory lost. Time allocation is managed based on text lengths, with a convention of one hundred words equating to approximately one minute within the simulation.
Algorithm 1 Moderator
0: $List:[Stages],List:[TimeArrangements]$
0: $Str:Response$
for all $stage,time_{stage}\leftarrow Stages,TimeArrangements$ do
$Response\leftarrow LLM(NewStagePrompt)$
$time_{cur}\leftarrow Estimate(Response)$
while $time_{cur}<time_{stage}$ do
if Response from participants then
$Response\leftarrow ParticipantResponse$
else if any participant is inactivate then
$Response\leftarrow LLM(InactivateParticipantPrompt)$
else
$Response\leftarrow LLM(InsightsPrompt)$
end if
Update $time_{cur}$ according to $Estimate(Response)$
end while
end for
Algorithm 2 outlines the systematic approach adopted by each AI participant throughout the discussion, with their level of engagement assessed by the LLM. The LLM dynamically evaluates the ongoing conversation and the contributions of other AI participants to gauge engagement levels. AI participants are provided the latitude to contribute to the discussion uninterrupted unless they surpass the stipulated time allocation. In instances where participants opt to disengage or exhibit novel ideas, signalling a lull in the discourse, the moderator intervenes by posing new questions, drawing inspiration from the preceding discussions. In parallel, the moderator actively encourages less active participants to actively partake in the discourse. Participant activity is monitored through the detection of speaking times within the ongoing stage. Participants who exhibit negligible speaking activity or speaking three times less than those of the most speaking participants are categorised as inactive.
Algorithm 2 Participants
0: $List:[Participants]$
0: $Str:response$
repeat
$engagements\leftarrow[]$
for all $participant$ in $Participants$ do
$engagements$ add $LLM(EngagementPrompt)$
end for
if $Max(engagements)\geq Threshold$ then
$speaker\leftarrow Participants[Index(Max(engagements))]$
$Response\leftarrow LLM(PartResponsePrompt(speaker))$
end if
until Finished
### 3.2. Voice-based Focus Agent with human participants
To make sure the AI moderator can communicate with human participants efficiently, S2T and T2S are necessary. APIs provided by various companies are often suitable for many scenarios. However, they fall short of our specific needs for facilitating multi-participant discussions in focus groups due to limitations related to the length of input recordings and the absence of speaker differentiation. To address these challenges, we have developed our own S2T system, as depicted in Figure 3. This system processes long discussion audio by segmenting it into shorter sentence-length audio pieces, leveraging VAD for segmentation. Subsequently, it identifies the most similar participant from a database of participant voices and transcribes the audio using the open-source S2T model Whisper by OpenAI (Radford et al., 2022). To ensure participants have ample opportunity to express their views without undue interruption, the AI moderator is programmed to intervene only after a silence of 5 seconds, thus differing from approaches that might actively disrupt the conversation flow.
<details>
<summary>extracted/5830562/S2T.png Details</summary>

### Visual Description
\n
## Diagram: Voice Processing Pipeline
### Overview
This diagram illustrates a voice processing pipeline, starting with audio input and culminating in transcripts generated by a Large Language Model (LLM). The pipeline includes stages for voice activation detection, speech recognition, speaker retrieval, and integration with a voiceprint database.
### Components/Axes
The diagram consists of the following components:
* **Input Audio:** Represented by a microphone icon.
* **Voice Activate Detection:** A decision block checking for silence. "N" indicates a "No" response (voice detected).
* **Speech Recognition:** Converts audio waveforms into text ("Sentence 1", "Sentence 2", "Sentence 3").
* **Voiceprint Database:** Contains "Participants Voiceprints".
* **Speaker Retrieval:** Identifies the speaker from the voiceprint database.
* **LLM:** A Large Language Model, the final processing stage.
* **Transcripts:** Output text associated with each participant ("Participant 1 Sentence 1", "Participant 2 Sentence 2", "Participant 1 Sentence 3").
* **Time Delay:** A horizontal arrow labeled "5 seconds" connecting the Voice Activate Detection to the LLM.
* **Waveforms:** Represent audio signals at various stages.
### Detailed Analysis or Content Details
The pipeline operates as follows:
1. **Input Audio** is received.
2. **Voice Activate Detection** checks for silence. If no silence is detected ("N"), the process continues.
3. The audio is passed to **Speech Recognition**, which converts it into text. Three example sentences are shown: "Sentence 1", "Sentence 2", and "Sentence 3".
4. The audio waveforms are also sent to **Speaker Retrieval**.
5. **Speaker Retrieval** accesses the **Voiceprint Database** to identify the speaker.
6. The identified speaker and the corresponding transcripts are then fed into the **LLM**.
7. The **LLM** generates final transcripts, labeled with the participant and sentence number: "Participant 1 Sentence 1", "Participant 2 Sentence 2", "Participant 1 Sentence 3".
8. There is a 5-second delay between voice activation and input to the LLM.
### Key Observations
* The diagram highlights a closed-loop system where audio is processed through multiple stages to achieve accurate transcription and speaker identification.
* The inclusion of a voice activation detection stage suggests a focus on minimizing processing of silent periods.
* The 5-second delay indicates a potential latency in the system.
* The diagram shows a simplified representation of the process, likely omitting details such as error handling or noise reduction.
### Interpretation
This diagram depicts a typical architecture for a voice-based system, such as a voice assistant or a transcription service. The pipeline demonstrates the integration of several key technologies: voice activity detection, automatic speech recognition (ASR), speaker diarization (through voiceprint matching), and natural language processing (NLP) via the LLM. The 5-second delay could be due to the time required for speech recognition, speaker retrieval, and LLM processing. The system aims to convert spoken audio into structured text data, associating each sentence with the correct speaker. The use of a voiceprint database suggests a focus on identifying known speakers, potentially for security or personalization purposes. The diagram suggests a system designed for multi-participant conversations, as evidenced by the multiple transcripts associated with different participants.
</details>
Figure 3. Speech to Text system. We divided long audio recording into short pieces with voice activity detection. Then we transcribed the short audio pieces and recognised the speaker according to the voiceprints collected in advance from the participants.
In order to incorporate T2S functionality into our system, we leveraged the Google TTS API https://console.cloud.google.com/speech/text-to-speech. To allow some participants who are interested in discussion in an immersive environment, we established the focus group environment within Mozilla Hubs https://hubs.mozilla.com, a Virtual Reality (VR) platform.
## 4. Pilot Study
To enhance user experience, we conducted a pilot study with four volunteers before the main user study to assess the system’s stability and the AI moderator’s effectiveness. The pilot included a 50-minute focus group discussion and a 30-minute feedback session on the AI agent’s performance.
Feedback from the pilot highlighted areas for improvement, which were addressed to optimise the user study:
1. Human participants may not always have insights for every query, unlike AI participants who consistently generate new content. Observations showed the AI moderator might repeat questions if there were no responses, leading to stagnation. We adjusted the AI moderator’s protocol to move on if no further responses were forthcoming.
1. Anonymity in Summaries: Volunteers were uncomfortable with being mentioned by name in summaries. We revised the process to ensure participant anonymity, enhancing comfort levels.
1. Conciseness of Questions: The long content generated by LLMs are not ideal for verbal interactions. We refined prompts to yield shorter responses.
Additionally, we assessed the S2T system’s accuracy to ensure comprehensive transcription and understanding by the agent. The Word Error Rate (WER) WER is a metric for gauging speech-to-text conversion accuracy, calculated as $WER=(S+D+I)/N$ , where $S$ denotes substitutions, $D$ deletions, $I$ insertions, and $N$ the total number of words in the reference text. served as the evaluation metric. Professional human transcribers typically achieve a WER of 11.3% in open conversational settings (Xiong et al., 2016). Using this as a benchmark, we found our S2T system achieved a WER of 4.6%, demonstrating commendable accuracy. For speaker identification, our system achieved a micro F1 Score of 0.81 using the EN2001 audio segment from the AMI Corpus (Carletta, 2006), highlighting its capability in recognising speakers. The pilot study indicated the agent exhibited no significant misunderstandings of the conversations.
## 5. User Study
To investigate our research questions, we designed a user study that involved human participants engaging in focus group discussions on the theme of “digital well-being,” alongside simulations of focus groups centred around the same topic. The objective of these sessions was to study individual practices in managing screen time and their perceptions of its impact on mental health. The choice of “digital well-being” as the focal topic was strategic, given its universal relevance, which facilitated participant recruitment. Participants had the option to join the focus groups either via a VR headset or through their personal computers, aiming for device consistency within groups to streamline the discussion dynamics, as shown in Figure 4.
Demographics. Our recruitment efforts yielded 23 participants, where we assigned 11 to join with VR headset and 12 to join with their own personal computer. The participant pool had an average age of 30 years ( $min=18$ , $max=60$ , $SD=10$ ), distributed across five groups–three with VR headset and two with desktop. Each group comprised 3 to 6 individuals, ensuring a diverse range of perspectives and experiences. The selection of the total number of groups is based on previous work (Guest et al., 2017), which has demonstrated that five groups are optimal for focus group studies.
Procedure. The user study included three distinct components: a primary focus group involving human participants (hereafter referred to as “ focus group ”), a meta focus group where human participants convened to reflect on their experiences within the focus group (hereafter referred to as “ meta focus group ”), and a simulated focus group with AI entities as participants (hereafter referred to as “ focus group simulation ”).
First, participants submitted a one-minute self-introduction audio recording before the focus group. This recording collected demographic information (age, prior focus group experience, and daily screen usage) and provided a unique voice print for each participant. This data initialised the AI participants in the simulation. We assessed English proficiency based on the accuracy of the S2T results from their recordings. Then participants accessed the designated meeting rooms in Mozilla Hubs. For VR groups, our team provided VR headsets (Quest series or Vive Pro), while the desktop group used their own PCs. Once all participants were ready, the researcher started the system, and the AI moderator began moderating the focus group. An author observed and recorded essential information throughout the sessions. The sessions were scheduled for 60 minutes, with an actual average duration of 51 minutes ( $SD=13minutes$ ).
Following the conclusion of each focus group discussion, a meta focus group was conducted. This session spanned approximately 20 minutes and was facilitated by one of the authors. The topic of the meta focus group mainly focuses on two points: the experience of focus group discussion and the attitude to the AI moderator.
At the end, each participant received a 10€ gift card as compensation. This study was reviewed and approved by the university’s ethics review board.
<details>
<summary>extracted/5830562/FocusAgent.png Details</summary>

### Visual Description
\n
## Screenshot: Virtual Meeting Room
### Overview
The image is a screenshot of a virtual environment, resembling a conference or meeting room. The perspective is first-person, looking towards the back of the room. Several virtual avatars are present, and the room is furnished with tables, chairs, and windows. The overall aesthetic is blocky and low-resolution, suggesting a game or virtual world platform like Roblox.
### Components/Axes
There are no axes or charts present in this image. The key elements are:
* **"EXIT" sign:** Located in the top-left corner, colored green and white.
* **"Moderator" label:** Floating above one of the avatars, positioned in the mid-left of the image.
* **Avatars:** Several distinct avatars are visible, each with unique appearances.
* **Furniture:** Tables and chairs are arranged in the room.
* **Windows:** Three windows are visible along the back wall, letting in light.
* **Lighting:** Strong directional lighting creates shadows across the floor.
### Detailed Analysis or Content Details
The image does not contain numerical data or a structured table. The following details can be observed:
* **Avatar 1 (Closest):** The avatar closest to the viewer is facing away, showing a black office chair.
* **Avatar 2 (Center):** An avatar with a large, spherical, reddish-brown head is visible. The head appears to have a textured pattern.
* **Avatar 3 (Left):** An avatar with a purple and blue body and a small head is present. It appears to be standing near a table with objects on it.
* **Objects on Table:** The table near Avatar 3 has several small, colorful objects on it. These appear to be virtual items or props.
* **Room Structure:** The room has a brown ceiling with recessed lighting. The walls are a light green color. The floor is a light-colored material with strong shadows cast upon it.
* **Window View:** The windows show a cloudy sky.
### Key Observations
* The "Moderator" label suggests a hierarchical structure within the virtual environment.
* The variety of avatar appearances indicates a customizable platform.
* The presence of objects on the table suggests interactive elements within the virtual space.
* The lighting and shadows create a sense of depth and realism, despite the blocky graphics.
### Interpretation
This image depicts a virtual meeting or social space. The presence of a "Moderator" suggests a structured event or discussion is taking place. The avatars represent users interacting within the environment. The room's design and the graphical style indicate a platform geared towards younger audiences or casual social interaction. The image doesn't provide any specific data or trends, but it offers a snapshot of a virtual social environment and the dynamics within it. The scene suggests a collaborative or educational setting, given the presence of a moderator and the arrangement of furniture. The objects on the table could be tools or resources used during the meeting. The overall impression is one of a playful and engaging virtual experience.
</details>
Figure 4. Users participant focus group using Focus Agent in VR environment.
## 6. Result Analysis
Following the methodological framework proposed by Gerling et al. (Gerling et al., 2020), we employed both thematic and content analyses to scrutinise the transcripts derived from the focus group and focus group simulation sessions. Additionally, thematic analysis was specifically applied to the meta focus group discussions to collect participant feedback. For the transcription of data from the user study, we utilised the outputs from our S2T system, subsequently refining these transcripts against the recorded audio by two researchers. The final evaluation of our S2T system showcased a WER of 2.5% and an F1 score of 0.9, indicating a level of performance sufficiently reliable for the purposes of our study. Due to recording issues, the data from the third focus group session was incomplete. The transcription for this group was reconstructed based on recollections and notes taken by an observer, and consequently, this data was not included in the accuracy assessment of the S2T system. The initial analysis was conducted by the lead author, with the findings subsequently reviewed and validated by the co-authors.
### 6.1. Focus Group
In the thematic analysis conducted on the transcriptions from both the human focus group and the focus group simulations, we elicited distinct themes related to our study topic. From the transcriptions, we identified four central themes. In contrast, the focus group simulations revealed five themes, incorporating an additional theme focused on the challenges associated with controlling screen time. This discrepancy mainly came from the differences in moderation performance between the two groups. In the focus group simulations, the AI moderator tends to guide AI participants to engage more deeply with the topics. While human participants in the focus group did broach additional topics, these were less related to the central theme of discussion, highlighting a contrast in how thematic expansion was handled across the two settings.
<details>
<summary>extracted/5830562/Content_Analysis.png Details</summary>

### Visual Description
\n
## Network Diagram: Daily Screen Usage, Impact, Strategies & Effects
### Overview
This image presents a network diagram illustrating daily screen usage, the impact of long screen time, strategies to balance screen time, and the effects after achieving balance. The diagram utilizes interconnected nodes and color-coding to represent different categories and their relationships. The diagram is divided into four main sections arranged in a 2x2 grid.
### Components/Axes
The diagram consists of four main sections:
1. **Daily Screen Usage:** A network of activities with node sizes representing usage frequency.
2. **Impact of Long Screen Time:** A network of negative consequences, also with node sizes indicating impact severity.
3. **Strategies to Balance:** A network of strategies to mitigate screen time, with node sizes representing implementation frequency.
4. **Effect After Balance:** A network of positive effects after implementing balance strategies, with node sizes indicating effect strength.
A legend in the top-right corner defines the color-coding:
* Green: AI
* Red: Human
* Pink: Both
### Detailed Analysis or Content Details
**1. Daily Screen Usage (Top-Left)**
This section displays a network of activities. Node sizes are proportional to usage frequency, indicated by numbers within the nodes.
* Navigate: 3
* Take Photos: 3
* Shop: 3
* Social Media: 8
* Update News: 1
* Play Games: 8
* Keep Relationships: 8
* Read E-book: 8
* Study/Work: 8
* Watch Videos: 8
* Volunteer: 1
**2. Impact of Long Screen Time (Top-Right)**
This section displays a network of negative impacts. Node sizes are proportional to the severity of the impact, indicated by numbers within the nodes.
* Comparison: 1
* FOMO: 4
* Eye Strain: 4
* Insecured: 4
* Stressed: 13
* Isolated: 1
* Overwhelmed: 1
* Self-doubt: 1
* Anxiety or depression due to negative content from social media: 13
* Increased Sense of Belonging: 1
* Insomnia: 13
* Guilty: 4
* Impaired focus: 4
* Reduced ability for face-to-face interaction: 4
* Limited creativity: 4
**3. Strategies to Balance (Bottom-Left)**
This section displays a network of strategies. Node sizes are proportional to the frequency of implementation, indicated by numbers within the nodes.
* Setting Boundaries: 0
* Digital Declutter: 0
* Engage in Offline Hobbies or Activities: 11
* Regular Exercise: 11
* Limit Screen Time with Apps: 11
* Schedule a Fulfilling Life: 5
* Self Reflection: 5
* Technology-Free Zones: 5
* Create a support system with friends: 5
* Practicing Mindfulness: 5
* Set 'Screen-Free' days: 5
* Reminders to Step Away From Screen: 0
* Setting Goals for Online Activities: 0
* Screen Blocking: 0
**4. Effect After Balance (Bottom-Right)**
This section displays a network of positive effects. Node sizes are proportional to the strength of the effect, indicated by numbers within the nodes.
* Anxiety: 2
* Stressful: 1
* Isolation: 1
* Presence: 3
* Fulfilled: 3
* Relaxed: 3
### Key Observations
* **Screen Usage:** Social Media and Study/Work are the most frequent activities (8 each).
* **Negative Impacts:** Stressed, Anxiety/Depression, and Insomnia are the most significant negative impacts (13 each).
* **Balance Strategies:** Engaging in Offline Hobbies/Activities, Regular Exercise, and Limiting Screen Time with Apps are the most frequently implemented strategies (11 each).
* **Positive Effects:** Presence, Fulfilled, and Relaxed are the strongest positive effects (3 each).
* The color coding indicates that most activities and impacts are related to "Human" experience (red).
* Several strategies and effects have a value of 0 or 1, suggesting they are less common or impactful.
### Interpretation
The diagram illustrates a clear relationship between screen usage, negative impacts, balance strategies, and positive effects. High screen usage, particularly in areas like social media and work, is associated with significant negative impacts on mental and emotional well-being (stress, anxiety, insomnia). Implementing balance strategies, especially those involving offline activities, can lead to positive effects such as increased presence, fulfillment, and relaxation.
The large values associated with "Stressed," "Anxiety/Depression," and "Insomnia" suggest these are major concerns related to excessive screen time. The prominence of offline activities as balance strategies highlights the importance of disconnecting from digital devices and engaging in real-world experiences. The diagram suggests a cyclical relationship: increased screen time leads to negative impacts, which motivates the implementation of balance strategies, ultimately resulting in positive effects.
The color coding reinforces the idea that these impacts and strategies are primarily related to the human experience, rather than being driven by AI or a combination of both. The diagram serves as a visual representation of the need for mindful screen usage and proactive strategies to maintain a healthy balance between the digital and physical worlds.
</details>
Figure 5. Content analysis according to the themes from both focus group and focus group simulation, font size indicates the frequency of the codes
In our content analysis, we derived 39 unique codes from the focus group transcriptions and 47 from the focus group simulations, each reflecting various facets of the discussion topic. To compare the perspectives of AI and human participants, we illustrated the overlap and divergence of these codes through a Venn diagram, as showcased in Figure 5. The analysis revealed that the majority of opinions expressed by human participants were also covered by AI participants. Interestingly, AI participants introduced several viewpoints not raised by their human counterparts, such as volunteering online during daily screen usage, adding additional dimensions to the discussion. Another observation from this analysis is the tendency of AI participants to express similar opinions more than human participants across different focus group sessions. The data referenced in Figure 6 reveal a discrepancy in code generation between simulation and focus groups. Each iteration of the focus group can collect similar number of codes. The result indicates that simulations of focus groups tend to generate higher repetition of identical codes. Following several iterations, the aggregate of unique codes converges, suggesting that the most common opinions have been collected. At this point, AI participants can not generate new codes, whereas human participants continue to demonstrate potential for such creativity. This observation underscores the tendency of AI to produce more common opinions, while human participants display greater variance and individuality in their perspectives.
<details>
<summary>extracted/5830562/CodeNumberChangeWithGroup.png Details</summary>

### Visual Description
\n
## Bar Chart: Code Number Increase with Groups
### Overview
This is a bar chart illustrating the increase in the number of unique codes generated by humans and an AI across five different groups. The x-axis represents the group number (1 to 5), and the y-axis represents the number of unique codes. Two data series are presented: one for "Human" generated codes (blue bars) and one for "AI" generated codes (orange bars).
### Components/Axes
* **Title:** "Code Number Increase with Groups" (positioned at the top-center)
* **X-axis Label:** "Number of group" (positioned at the bottom-center)
* **X-axis Markers:** 1, 2, 3, 4, 5 (equally spaced along the x-axis)
* **Y-axis Label:** "Unique Codes" (positioned on the left-center)
* **Y-axis Scale:** 0 to 50, with increments of 5.
* **Legend:** Located at the bottom-center.
* **Human:** Represented by a blue bar.
* **AI:** Represented by an orange bar.
### Detailed Analysis
The chart consists of ten bars, five for each data series (Human and AI).
**Human (Blue Bars):**
* **Group 1:** Approximately 11 unique codes.
* **Group 2:** Approximately 24 unique codes. The bar rises significantly.
* **Group 3:** Approximately 34 unique codes. Continues to rise.
* **Group 4:** Approximately 36 unique codes. A slight increase.
* **Group 5:** Approximately 38 unique codes. A small increase.
**AI (Orange Bars):**
* **Group 1:** Approximately 37 unique codes.
* **Group 2:** Approximately 46 unique codes. The bar rises significantly.
* **Group 3:** Approximately 46 unique codes. Remains constant.
* **Group 4:** Approximately 46 unique codes. Remains constant.
* **Group 5:** Approximately 48 unique codes. A small increase.
The AI consistently generates a higher number of unique codes than humans across all groups. Both series show an increasing trend in unique codes as the group number increases, but the rate of increase slows down for both series in the later groups.
### Key Observations
* The AI consistently outperforms humans in generating unique codes.
* Both humans and the AI show an initial rapid increase in unique codes from Group 1 to Group 2, followed by a slower, more gradual increase in subsequent groups.
* The AI's code generation plateaus between Groups 2 and 4, remaining at approximately 46 unique codes.
* The difference in unique codes between humans and the AI is most pronounced in Group 1, and narrows as the group number increases.
### Interpretation
The data suggests that the AI is more effective at generating unique codes than humans, particularly in the initial groups. The initial rapid increase in code generation for both humans and the AI could be due to the introduction of new concepts or challenges in each group. The plateau in the AI's code generation from Groups 2 to 4 might indicate that the AI has reached a limit in its ability to generate novel codes within the given constraints, or that the task has become saturated. The continued, albeit slower, increase in code generation for both humans and the AI in the later groups suggests that they are still able to adapt and generate new codes, but at a diminishing rate. This could be due to the increasing difficulty of the task or the exhaustion of available coding possibilities. The chart demonstrates a clear difference in performance between human and AI code generation, with the AI consistently producing a higher number of unique codes.
</details>
Figure 6. Unique code number increased according to the round of focus group and focus group simulation.
### 6.2. Meta Focus Group
According to the transcriptions of the meta focus group, we coded 51 data points and identified three main themes.
We derived three themes from the data: 1) User Experiences of the Virtual Focus Group; 2) User Attitudes towards the Focus Agent, which is further divided into two sub-themes: a) Positive Attitudes, and b) Negative Attitudes; and 3) Feedback on the Virtual Focus Group System.
#### Theme 1: User Experiences of Focus Group.
A majority of participants conveyed satisfaction with the focus group discussions, highlighting several reasons. For many, the topics discussed were directly relevant to their daily lives, adding value to their participation. As one participant explained, “ I think it’s great to discuss these topics because that’s what we deal with every day. ” (G5, P4). Furthermore, participants appreciated the diversity of perspectives present, valuing the opportunity to exchange experiences. An exemplifying statement reads, “ I think you bring up so many great points. It’s very enriching to hear different perspectives. ” (G5, P3). At the end of the discussions, the moderator inquired whether participants had any additional opinions on the topic that they had not had the opportunity to express during the session. All participants confirmed that they had no further insights to share, indicating that the discussions had comprehensively covered the topic from their perspectives.
#### Theme 2: Attitude to the Focus Agent.
The second theme encapsulates the users’ feedback and experiences with Focus Agent. This theme is divided into two categories: positive and negative, to provide a clearer understanding of the users’ attitudes towards Focus Agent.
SubTheme 1: positive attitude. A prevalent sentiment among participants was their appreciation for the guidance offered by the Focus Agent, acknowledging its efficacy in steering the discussions. As an example, one participant remarked, “ The moderator kind of did a good job by posing questions that allowed us to express our thoughts and encouraged other participants to share their sentiments on the topic. ” (G4, P1). Furthermore, three participants specifically commended the Focus Agent’s clear articulation in English, while an additional participant admired the agent’s friendly demeanour.
SubTheme 2: Negative attitude. The prevailing sentiment among participants leaned towards dissatisfaction with the Focus Agent’s performance. A recurring concern revolved around the repetition of questions, as one participant articulated, “ I found it somewhat confusing at times since the moderator repeated the questions several times, which we had already discussed ” (G1, P2). Another noteworthy issue was the perceived lack of intellectual acumen exhibited by the Focus Agent during discussions. For example, one participant expressed, “ I don’t believe it possesses true intelligence, nor does it seem capable of comprehending all the information we’ve conveyed, let alone guiding us into more profound and coherent discussions ” (G5, P2). At last, some biases were identified in the discussion, notably in steering participants towards articulating the adverse effects associated with prolonged screen use, “ When discussing the impact of long screen using time, I felt that the AI moderator tried to demonise the technology. (G1, P1)”.
#### Theme 3: Feedback on virtual focus group system.
The third theme encapsulates certain system issues encountered during the use of Focus Agent. A concern raised by some participants was the insufficient time allocated for responding to questions, resulting in interruptions by the agent. As articulated by one participant, “ There were instances where we were attempting to respond to a question or had just commenced our response when the moderator interrupted us and swiftly moved on to the next question ” (G3, P3). Furthermore, two participants recommended the incorporation of subtitles to augment their understanding of the questions posed.
## 7. Discussion
In this discussion, we address the RQs through our findings and expand on the underlying reasons informed by our analysis.
### 7.1. RQ1: To what extent do the opinions generated by a LLM align with those of human participants in focus group?
The content analysis of the focus group discussions revealed that opinions generated by AI tend to encompass a wide array of human perspectives within the designated topic. Nevertheless, these AI-generated opinions often reflected more common viewpoints, demonstrating a lack of the uniqueness commonly found in human responses. A possible explanation is that, unlike human participants, who dynamically build upon previous contributions and enrich discussions with personal experiences, AI responses largely appeared as potentially plausible experiences that might happen to people.
This observation suggests that LLMs could serve as a tool for researchers aiming to streamline the focus group process with human participants. By deploying a Focus Agent, researchers could initially gather a broad spectrum of common opinions on a specific topic, thereby setting a foundational understanding of the expected participant responses. This could further assist in refining the focus group’s questions and topics, making the discussion more targeted and efficient. Therefore, fewer human focus group sessions may be required to confirm the AI-generated content and identify novel insights from participants, optimising the research process while still uncovering the unique, creative perspectives that only human participants can provide. However, human participants are still necessary for current focus groups to make sure the data is reliable.
### 7.2. RQ2: To what extent is a LLM effective in performing the duties of a moderator in focus group discussions?
During the focus group simulations, LLMs demonstrated sufficient knowledge to facilitate the group and engage with AI participants effectively. Feedback from the meta focus group indicated that human participants acknowledged the AI moderator’s capability to support the discussion, albeit perceiving it more as a tool rather than a sentient interlocutor. This perception was attributed to the AI moderator’s lack of apparent intelligence in interactions, such as overlooking participant requests, posing repetitive questions, or failing to grasp the hints behind conversations.
The primary challenges are rooted in the inherent limitations of LLMs in navigating multi-person dialogues. For LLMs to respond appropriately, they must comprehend inputs from human participants, reason through the conversation’s context, and formulate accurate responses. While their reasoning capacity seemed adequate during simulations, issues predominantly arose in understanding and response generation phases. Existing research has begun to address LLMs’ comprehension issues when assisting humans, yet their effectiveness in multi-participant discussions remains constrained (Dong et al., 2023). From an understanding standpoint, discussions among human participants often involve colloquial language and incomplete sentences, differing markedly from the more structured exchanges with AI, leading to the AI moderator’s difficulties in recognising whether questions had been answered. Consequently, the AI moderator might repetitively address the same points rather than progressing the discussion. Additionally, challenges in generating aligned and unbiased content persist within LLM outputs (Wang et al., 2023b; Taubenfeld et al., 2024). Although not the primary focus of our study, participants noted issues such as deviation from guidelines or biases in the discussion (see subsection 6.2), underscoring the LLMs’ limitations in mimicking human conversational norms accurately.
Given these observations, we advise against deploying the Focus Agent as the sole moderator in focus group discussions due to the current inadequacies in human-AI communication. Instead, the AI-generated summaries and questions could be utilised by human moderators to streamline the discussion flow and address specific topics. For more in-depth discussions, the presence of a human moderator is essential to ensure a positive user experience and foster the generation of innovative insights, highlighting the complementary roles of AI and human moderators in enhancing the efficacy of focus groups.
### 7.3. Improvement of Focus Agent
Based on the process of the user study and the insights gathered during the meta focus group, several areas for enhancing the structure and functionality of the Focus Agent have been identified:
1. Design of thought chain: Although the thought chain in our work shows enough ability to facilitate the focus group discussion, to be able to facilitate a deeper topic during discussion a more complex design is required, for example one such as the tree of thoughts (Yao et al., 2023).
2. Subtitles for the Focus Agent’s speech: Participants suggested that the speech of the Focus Agent might be too long for them to be able to comprehend a question in its entirety. In this case, subtitles would be a useful help.
3. Time schedule of Focus Agent: the Time allocation was a pre-determined time duration. However, the time should be allocated according to the participants’ engagement in the current discussion. In this case, the Focus Agent should make dynamic time allocations based on the flow of the discussion.
## 8. Limitation and Future Work
Our investigation underscores several limitations that pave the way for future research directions.
First, the current iteration of the Focus Agent is limited to text-based interactions, differing significantly from the multi-modal nature of human moderation. Human moderators use non-verbal cues and physical context to tailor their approach, which text-only agents cannot replicate. This limitation is particularly challenging in settings involving tactile or visual elements. However, advancements in sophisticated LLMs like GPT-4, which understand multi-modal data (OpenAI, 2023), could evolve the Focus Agent into a more versatile, multi-modal platform that closely simulates human discussions.
Second, our study centres on LLM application within focus groups, overlooking broader quantitative and qualitative research methodologies. Prior studies have used LLMs to generate reviews or comments (Liang et al., 2023; Chuang et al., 2023), noting that LLM-generated opinions may lack human creativity (Bender et al., 2021). Ensuring the validity of these insights requires extensive empirical validation.
Lastly, our analysis highlights the difficulties LLMs face in multi-participant discussions. While there is some research on one-on-one dialogues and all-AI discussions (Abbasiantaeb et al., 2024), studies on mixed human-AI communication in group settings are scarce. The ability of LLMs to engage in multi-human conversations is crucial for advancing human-AI interaction. Future research should explore how human participants adjust their communication strategies in the presence of AI, aiming to optimise these interactions for better collaborative outcomes.
## 9. Conclusion
Our research introduced the Focus Agent, a novel AI simulation system developed to simulate focus group discussions through the dialogue of AI agents. This system aims to gather insights akin to those derived from traditional focus groups, leveraging the capabilities of AI participants to generate discussions on designated topics. To assess the degree of alignment between the viewpoints expressed by AI and human participants, we ran a user study that employed an AI moderator to facilitate discussions among human participants. Our analysis uncovered that the Focus Agent includes opinions that similar to those of human participants. Additionally, we studied human participants’ perceptions of the AI moderator and found that while the AI could fulfil the functional role of a moderator, there remained some differences in the interaction experience compared to engagement with human moderators. We examined the underlying reasons and identified specific areas within the large language model’s capabilities that require further enhancement.
## References
- (1)
- Abbasiantaeb et al. (2024) Zahra Abbasiantaeb, Yifei Yuan, Evangelos Kanoulas, and Mohammad Aliannejadi. 2024. Let the llms talk: Simulating human-to-human conversational qa via zero-shot llm-to-llm interactions. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 8–17.
- Bain et al. (2023) Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. INTERSPEECH 2023 (2023).
- Bender et al. (2021) Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
- Biedermann (2018) Narelle Biedermann. 2018. The use of Facebook for virtual asynchronous focus groups in qualitative research. Contemporary nurse 54, 1 (2018), 26–34.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Brüggen and Willems (2009) Elisabeth Brüggen and Pieter Willems. 2009. A critical comparison of offline focus groups, online focus groups and e-Delphi. International Journal of Market Research 51, 3 (2009), 1–15.
- Carletta (2006) Jean Carletta. 2006. Announcing the AMI meeting corpus. The ELRA Newsletter 11, 1 (2006), 3–5.
- Chen and Neo (2019) Julienne Chen and Pearlyn Neo. 2019. Texting the waters: An assessment of focus groups conducted via the WhatsApp smartphone messaging application. Methodological Innovations 12, 3 (2019), 2059799119884276.
- Chuang et al. (2023) Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy T Rogers. 2023. Simulating Opinion Dynamics with Networks of LLM-based Agents. arXiv preprint arXiv:2311.09618 (2023).
- Ciaccio (2023) Edward J Ciaccio. 2023. Use of artificial intelligence in scientific paper writing. , 101253 pages.
- Contributors (2023) OpenCompass Contributors. 2023. OpenCompass: A Universal Evaluation Platform for Foundation Models. https://github.com/InternLM/OpenCompass.
- Daniels et al. (2019) Nicola Daniels, Patricia Gillen, Karen Casson, and Iseult Wilson. 2019. STEER: Factors to consider when designing online focus groups using audiovisual technology in health research. International Journal of Qualitative Methods 18 (2019), 1609406919885786.
- Desplanques et al. (2020) Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143 (2020).
- Dong et al. (2023) Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu. 2023. Towards next-generation intelligent assistants leveraging llm techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5792–5793.
- Gerling et al. (2020) Kathrin Gerling, Patrick Dickinson, Kieran Hicks, Liam Mason, Adalberto L Simeone, and Katta Spiel. 2020. Virtual reality games for people using wheelchairs. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–11.
- Gordon et al. (2021) Allegra R Gordon, Jerel P Calzo, Rose Eiduson, Kendall Sharp, Scout Silverstein, Ethan Lopez, Katharine Thomson, and Sari L Reisner. 2021. Asynchronous online focus groups for health research: case study and lessons learned. International journal of qualitative methods 20 (2021), 1609406921990489.
- Gratton and O’Donnell (2011) Marie-France Gratton and Susan O’Donnell. 2011. Communication technologies for focus groups with remote communities: a case study of research with First Nations in Canada. Qualitative Research 11, 2 (2011), 159–175.
- Guest et al. (2017) Greg Guest, Emily Namey, and Kevin McKenna. 2017. How many focus groups are enough? Building an evidence base for nonprobability sample sizes. Field methods 29, 1 (2017), 3–22.
- Haupt and Marks (2023) Claudia E Haupt and Mason Marks. 2023. AI-generated medical advice—GPT and beyond. Jama 329, 16 (2023), 1349–1350.
- Horiguchi et al. (2020) Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji Nagamatsu. 2020. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. arXiv preprint arXiv:2005.09921 (2020).
- Horiguchi et al. (2021) Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, and Kenji Nagamatsu. 2021. End-to-end speaker diarization as post-processing. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7188–7192.
- Jokinen and McTear (2022) Kristina Jokinen and Michael McTear. 2022. Spoken dialogue systems. Springer Nature.
- Katar et al. (2023) Oğuzhan Katar, Dilek ÖZKAN, Özal YILDIRIM, U Rajendra Acharya, et al. 2023. Evaluation of GPT-3 AI language model in research paper writing. Turkish Journal of Science and Technology 18, 2 (2023), 311–318.
- Katz et al. (2024) Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2024. Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A 382, 2270 (2024), 20230254.
- Keen et al. (2022) Sam Keen, Martha Lomeli-Rodriguez, and Helene Joffe. 2022. From challenge to opportunity: virtual qualitative research during COVID-19 and beyond. International Journal of Qualitative Methods 21 (2022), 16094069221105075.
- Kitzinger (1994) Jenny Kitzinger. 1994. The methodology of focus groups: the importance of interaction between research participants. Sociology of health & illness 16, 1 (1994), 103–121.
- Kitzinger (1995) Jenny Kitzinger. 1995. Qualitative research: introducing focus groups. Bmj 311, 7000 (1995), 299–302.
- Koubaa (2023) Anis Koubaa. 2023. GPT-4 vs. GPT-3.5: A concise showdown. (2023).
- Liang et al. (2023) Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, et al. 2023. Can large language models provide useful feedback on research papers? A large-scale empirical analysis. arXiv preprint arXiv:2310.01783 (2023).
- Mazza (2006) Riccardo Mazza. 2006. Evaluating information visualization applications with focus groups: the CourseVis experience. In Proceedings of the 2006 AVI workshop on BEyond time and errors: novel evaluation methods for information visualization. 1–6.
- Medennikov et al. (2020) Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, Mariya Korenevskaya, Ivan Sorokin, Tatiana Timofeeva, Anton Mitrofanov, Andrei Andrusenko, Ivan Podluzhny, et al. 2020. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. arXiv preprint arXiv:2005.07272 (2020).
- Nagle and Williams (2013) Barry Nagle and Nichelle Williams. 2013. Methodology brief: Introduction to focus groups. Center for Assessment, Planning and Accountability 1-12 (2013).
- Nay et al. (2024) John J Nay, David Karamardian, Sarah B Lawsky, Wenting Tao, Meghana Bhat, Raghav Jain, Aaron Travis Lee, Jonathan H Choi, and Jungo Kasai. 2024. Large language models as tax attorneys: a case study in legal capabilities emergence. Philosophical Transactions of the Royal Society A 382, 2270 (2024), 20230159.
- Nicholas et al. (2010) David B Nicholas, Lucy Lach, Gillian King, Marjorie Scott, Katherine Boydell, Bonita J Sawatzky, Joe Reisman, Erika Schippel, and Nancy L Young. 2010. Contrasting internet and face-to-face focus groups for children with chronic health conditions: Outcomes and participant experiences. International Journal of Qualitative Methods 9, 1 (2010), 105–121.
- OpenAI (2023) R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
- Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442 (2023).
- Radford et al. (2022) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356 [eess.AS]
- Reynolds and McDonell (2021) Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1–7.
- Richard et al. (2021) Brendan Richard, Stephen A Sivo, Robert C Ford, Jamie Murphy, David N Boote, Eleanor Witta, and Marissa Orlowski. 2021. A guide to conducting online focus groups via Reddit. International journal of qualitative methods 20 (2021), 16094069211012217.
- Rosenbaum et al. (2002) Stephanie Rosenbaum, Gilbert Cockton, Kara Coyne, Michael Muller, and Thyra Rauch. 2002. Focus groups in HCI: wealth of information or waste of resources?. In CHI’02 extended abstracts on human factors in computing systems. 702–703.
- Rough and Cowan (2020) Daniel Rough and Benjamin Cowan. 2020. Don’t Believe The Hype! White Lies of Conversational User Interface Creation Tools. In Proceedings of the 2nd Conference on Conversational User Interfaces. 1–3.
- Sallam (2023) Malik Sallam. 2023. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, Vol. 11. MDPI, 887.
- Selter et al. (2023) Felicitas Selter, Kirsten Persson, Peter Kunzmann, and Gerald Neitzke. 2023. End-of-life decisions: A focus group study with German health professionals from human and veterinary medicine. Frontiers in Veterinary Science 10 (2023), 1044561.
- Shanahan et al. (2023) Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models. Nature 623, 7987 (2023), 493–498.
- Stalmeijer et al. (2014) Renée E Stalmeijer, Nancy McNaughton, and Walther NKA Van Mook. 2014. Using focus groups in medical education research: AMEE Guide No. 91. Medical teacher 36, 11 (2014), 923–939.
- Stewart and Shamdasani (2017) David W Stewart and Prem Shamdasani. 2017. Online focus groups. Journal of Advertising 46, 1 (2017), 48–60.
- Stewart and Shamdasani (2014) David W Stewart and Prem N Shamdasani. 2014. Focus groups: Theory and practice. Vol. 20. Sage publications.
- Talebirad and Nadiri (2023) Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314 (2023).
- Taubenfeld et al. (2024) Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. 2024. Systematic biases in LLM simulations of debates. arXiv preprint arXiv:2402.04049 (2024).
- Troshani et al. (2021) Indrit Troshani, Sally Rao Hill, Claire Sherman, and Damien Arthur. 2021. Do we trust in AI? Role of anthropomorphism and intelligence. Journal of Computer Information Systems 61, 5 (2021), 481–491.
- Turney and Pocknee (2005) Lyn Turney and Catherine Pocknee. 2005. Virtual focus groups: New frontiers in research. International Journal of Qualitative Methods 4, 2 (2005), 32–43.
- Veloso (2020) Braian Veloso. 2020. WHATSAPP COMO FERRAMENTA PARA A ORGANIZAÇÃO DE GRUPOS FOCAIS ONLINE NA PESQUISA DA EDUCAÇÃO: UM RELATO DE EXPERIÊNCIA. In Anais do CIET: EnPED: 2020-(Congresso Internacional de Educação e Tecnologias— Encontro de Pesquisadores em Educação a Distância).
- Wang et al. (2023a) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2023a. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
- Wang et al. (2023b) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023b. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
- Wenzek et al. (2019) Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. CCNet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359 (2019).
- Wilkerson et al. (2014) J Michael Wilkerson, Alex Iantaffi, Jeremy A Grey, Walter O Bockting, and BR Simon Rosser. 2014. Recommendations for internet-based qualitative health research with hard-to-reach populations. Qualitative health research 24, 4 (2014), 561–574.
- Wirtz et al. (2019) Andrea L Wirtz, Erin E Cooney, Aeysha Chaudhry, Sari L Reisner, and American Cohort To Study HIV Acquisition Among Transgender Women. 2019. Computer-mediated communication to facilitate synchronous online focus group discussions: feasibility study for qualitative HIV research among transgender women across the United States. Journal of medical Internet research 21, 3 (2019), e12569.
- Xiong et al. (2016) Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016).
- Xu et al. (2023) Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. 2023. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658 (2023).
- Xu et al. (2024) Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817 (2024).
- Yan et al. (2024) Hanqi Yan, Qinglin Zhu, Xinyu Wang, Lin Gui, and Yulan He. 2024. Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning. arXiv preprint arXiv:2402.14963 (2024).
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
- Zhu et al. (2023) Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).