A Case Study: Testing of an Artificial Intelligence Chatbot Designed for Guiding and Assisting Participants at a Biobank Research Site

Abstract

This study investigated if improving a chatbot using user test feedback can enhance the perceived value of a chatbot for a specific biobanking use case. The chatbot was designed to guide participants in a research study around a biobank site to facilitate their data collection experience by conversing with them about what to do and where to go and answering any questions about the measurements, the research, or the biobank. A total of 32 test participants tested the chatbot and provided feedback about their experience, and the answers were analyzed to evaluate perceived value. The first group of users tested the initial version of the chatbot, and the second group tested the improved version of the chatbot. The results demonstrated that by including user feedback to design the chatbot, the perceived value in the second testing increased. Following these early-stage study results, it was determined that the chatbot was ready for deployment in the biobank.

Keywords

biobank AI chatbot guidance assistance data management information technology

Introduction

While there is increasing evidence that self-measurements for biobank data collection can improve participant empowerment and data quality,¹ there is concern regarding the sufficiency of support during autonomous data collection. A biobank in the Netherlands that collects physical and lifestyle data from a large cohort of volunteers recently implemented a self-measurement for body composition on site, among other measurements. While the aim of introducing this self-measurement was to improve the efficiency of the site and reduce staff burden, in practice, the participants experienced confusion when performing the self-measurement and still relied on staff for guidance. With the rise of large language models, chatbots have demonstrated humanlike performance as assistants in healthcare.² Aevai Health B.V., a start-up in the Netherlands, developed a chatbot (Alva) to guide participants during two biobank site visits (one of which includes self-measurements such as weight and height, and the second includes blood tests performed by a doctor’s assistant) and to improve participant experience by answering questions about the biobank, the research, and the measurements. However, the perceived value of such chatbots as assistants largely depends on their ability to meet participant needs and expectations. Aevai Health B.V. leveraged user feedback to improve the chatbot before on-site deployment.

This early-stage case study investigates whether systematically improving the chatbot based on user feedback could enhance its perceived value as a guiding tool for biobank data collection visits.

Materials and Methods

Chatbot

In this case study, a chatbot (Alva) was offered to participants as a web app interface with an instruction video, accessed via a quick response (QR) code. Alva used a structured conversation to guide participants to different stations at the site and to perform measurements. Alva also included the technology “Retrieval Augmented Generation” and used privacy-preserving open-source, large language models—customized with a knowledge base to provide answers accurately about the biobank and the measurements. The first version of Alva and the improved version of Alva were developed and tested during this case study, referred to as V1 and V2, respectively. In Figure 1, the functionalities of Alva are shown via the different web app pages.

FIG. 1.

(A) Alva (V1 and V2) web app home page showing the introduction video. (B) Alva (V2) AI-based assistance (user prompted a question). (C) Alva (V2) guidance around the site, including to the self-measurement room. (D) Alva (V2) assistance with a possible error during the body composition self-measurement by providing a few sample questions or the option for the user to type a question.

Testing sessions and data collected

Two user testing sessions were conducted prior to the deployment of Alva at the biobank site. Both testing sessions involved an anonymous simple survey of several questions, including open text fields, six of which are included in this analysis. The transcripts of the participants with Alva were collected but were not included in this analysis. Alva can guide the participants through two different visit types, and the tests involved both visits 1 and 2. For the first testing session, Alva V1 was tested by staff from the biobank and external participants, promoting diversity (n = 20). To ensure thorough initial testing of Alva V1, at least two participants were required to watch the instructional video, and at least two participants were asked to follow a specific conversational flow (based on their initial health status). The rest of the pool of participants was free to choose whether they watched the instructional video or what conversational flow they followed. During the second testing session, Alva V2 was tested by new, external participants (n = 12), and no restrictions on conversational flow were implemented. There was no overlap between the participants in the two groups.

Participant population

The participant population chosen for both of the testing rounds included participants with ages between 26 and 65 years old, and it was primarily a Dutch population (which was representative of the actual biobank participant population). For the first and second testing sessions, 36% and 8%, respectively, of the participant population had no prior experience with AI chatbots.

IRB statement

This study was reviewed and deemed ethically acceptable by the biobank and by Aevai Health B.V.; more details can be found in the Supplementary Information.

Data analysis

Median and bootstrapped 95% confidence intervals were calculated to summarize the rating results collected by the surveys and were compared between the two testing sessions. In addition, the distribution of participants that did not experience frustration during chatting with Alva and would use Alva again for guidance at a biobank site was used as a proxy for a perceived value of the chatbot.

Results

The results from six survey questions selected for this analysis and the changes made between Alva V1 and Alva V2 are shown in the tables as follows. For Alva V1, the median rating for the first four questions and examples of the feedback provided are shown in Table 1, and the distribution of results from the last two questions is shown in Table 2. In Table 3, the changes that were made based on user feedback to improve Alva V1 to become Alva V2 are provided. For Alva V2, the median rating for the first four questions and examples of feedback provided are shown in Table 2, and the distribution of results from the last two questions is shown in Table 4. Additional technical changes were made to reach Alva V2 (see the Supplementary Information Table S1).

Table 1.

The Median Rating (with 1 as the Lowest and 10 as the Highest Score) and Bootstrapped 95% Confidence Intervals on Survey Feedback Questions from Test 1 Participants (n = 20) with Alva V1

Question	Median rating	Feedback examples
How accessible was the digital assistant?	9.0 (8.0, 10.0)	None
How easy were the buttons of the digital assistant (back button and send button)?	7.0 (6.0, 8.0)	“Back button was not clear, preventing me from asking next question. When asked the question: do you have more questions: please button yes or no” “When you want to ask an additional question on visit 2, was not clear to click back button.” “I liked the ease of use and user-friendly interface”
How easy was it to understand the guidance of the digital assistant?	7.0 (5.5, 7.0)	“During visit two, no instructions were given, I asked a question and then was suddenly given information for after the blood draw, even though this was not indicated that I had to/had gone there” “A number of sample questions were shown, but I could only click on 1. The rest I had to enter manually.” “I was in the group that was not allowed to do the self-measurement. Then you immediately get told about completing the computer test. when you haven’t started at all.” “Text speed very high. While reading, the text automatically scrolls to the next message while I haven’t finished reading.”
How would you rate the quality of the digital assistants’ answers?	6.5 (6.0, 7.0)	“The standard questions go well, but there are also many standard answers in there. With new questions, I often didn’t get substantive answers. Also contact details of <biobank name> are not known” “I did see some grammatical errors here and there, otherwise all the questions were answered well and clearly. When I asked the question ‘My question is not listed here’ I did get a very long explanation about the whole process, my question was already answered in the second post but I had to wait for the whole story before I could continue with Alva.”

Table 2.

The Distribution on Survey Feedback Questions from Test 1 Participants (n = 20) with Alva V1

Question	Distribution	Feedback quotes
Was there a moment in the interaction with the digital assistant when you felt frustrated?	The number of participants that did not find any part of the interaction with Alva V1 frustrating (nonfrustrated participants in blue and frustrated participants in gray)	“Could not ask a question (the chat bar is not clear)” “Not helped when asked how do I make a new appointment, that would be a nice if it could.” “Going through the process in slightly smaller steps. Which gives you more the questions that belong to the point where you are at this point in the visit.”
Would you use this chatbot again during the guidance of your visit at the biobank?	The number of participants that would use Alva V1 again to guide them during their biobank visit (participants that would use it again in blue and participants that would not use it again in gray)	“The amount of information Alva can give to you, for example, I asked what 24-hour urine is and Alva knew how to explain this very well.” “I tried to ask her about football and she told me she can’t help me with that but with other things.” “I wasn’t super satisfied because it did make me uncertain as well, but also not dissatisfied because I did get more information”

Table 3.

The User-Driven Points of Improvement That Were Made to Obtain Alva V2 Based on the Group 1 Testing Session

Type of feature	User feedback with ALVA V1	Improvements made in ALVA V2
UX/UI	• Participants found the chat area too small and were not likely to use it to type questions. • Participants found the speed of Alva’s responses too fast.	• Chat area was enlarged • Responses from Alva were slowed down.
Conversation design	• Participants indicated that more guidance during the visits was needed, specifically more clarity about where the participant needed to go during the visit. • When at the self-measurement machine, the participant was offered a few sample prompt questions if the participant chooses the option that their question is not one of the sample questions, then Alva would go through all of the steps to perform the self-measurement. This felt unnecessary since the steps were also provided in a poster in the measurement room. • When participants chose a sample prompt question and received the answer, the conversation flow continued to the next measurement. However, they would have liked the option to go back and press on another sample prompt question.	• More chat messages were added in the conversation flow to tell the participant where to go. • The conversation flow was edited such that Alva referred to the poster instead of taking the user through all the self-measurement steps. • When a participant chooses a sample question, they are returned to the list of sample questions rather than just continuing with the conversation.
Knowledge base	• Participants asked relevant questions that Alva could not always answer, for example, about how to make a new appointment for a biobank visit.	• The conversations from the testing were collected, and for cases with incomplete responses from Alva, a staff member’s answer was collected and added to the knowledge base.

Table 4.

The Median Rating (With One as the Lowest and Ten as the Highest Score) and Bootstrapped 95% Confidence Intervals on Survey Feedback Questions from Test 2 Participants (n = 12) with Alva V2

Question	Median rating	Feedback examples
How accessible was the digital assistant?	10 (9, 10)	None
How easy were the buttons of the digital assistant (back button, send button)?	9 (8, 10)	“The back button when disabled should be more in a disabled style” “The look and feel was really simple and intuitive”
How easy was it to understand the guidance of the digital assistant?	9 (8, 10)	“Examples of questions you can ask that actually made me think and encouraged me to ask questions” “Increase speed of responses to questions”
How would you rate the quality of the digital assistants’ answers?	8 (7, 9)	“I asked several questions that Alva answered well.” “The answers seemed natural and understandable, for example, facts about storage time were direct and to the point.” “When I asked about how long my data will be kept, Alva replied as long as possible. When I asked about what was meant by as long as possible, Alva said 30 years. That did shock me, because if it is that long then it should be indicated right the first time I asked.”

Discussion

After the first testing session with Alva V1, participants highly rated the accessibility of the chatbot, and provided mixed but relatively high ratings for the ease of use of the interface, the guidance, and the quality of the responses (Table 1). However, the majority of participants did find the interaction frustrating (Table 2). While the majority of participants indicated that they would use Alva again, they perceived greater value from the assistance offered by Alva rather than from the guidance offered by Alva. Based on this, several improvements were made to the conversation flow and the knowledge base of the chatbot (Table 3): the implementation of these changes is described in the Supplementary Information. In the second test, with Alva V2, the participants provided higher ratings for all of the questions (Table 4), and the majority did not find the interaction with Alva frustrating; all of them would use Alva again (Table 5).The main persistent criticism was regarding the quality of the answers, that Alva can initially be vague and will only provide more details when the participant asks specifically for it. Alva was designed to respond concisely with simple terminology for a lay audience. A future improvement would be to include more personalization; that is, Alva matches the language style of the user after a few prompts. In addition, a relatively small knowledge base was used, which can reduce the accuracy and specificity of the answers. In a recent study to explore postoperative care with a chatbot, the authors found that it could only respond adequately independently 31% of the time to the patient, possibly due to a lack of medical vocabulary understood by the chatbot.³ This emphasizes the importance of the knowledge used by the chatbot to understand and answer user queries.

Table 5.

The Distribution on Survey Feedback Questions from Test 2 Participants (n = 12) with Alva V2

Criteria	Distribution	Feedback examples
Was there a moment in the interaction with the digital assistant when you felt frustrated?	The number of participants that did not find any part of the interaction with Alva V1 frustrating (nonfrustrated participants in blue and frustrated participants in gray)	“That I found the answers a little short and incomplete, or that I saw the back button in the picture that I would just leave out” “Very long reply spread over multiple messages. Would rather have received 1 message with the link for further information than first 4 separate messages and then a message with the link for more information”
Would you use this chatbot again during the guidance of your visit at the biobank?	The number of participants that would use Alva V1 again to guide them during their biobank visit (participants that would use it again in blue and participants that would not use it again in gray)	“Really impressed with the experience with Alva overall.” “It clearly answers a lot of questions.” “The spelling and grammar of the Dutch are not good yet, otherwise I think the answers are sufficient.”

Limitations

We acknowledge several methodological limitations of this study. A small sample size of participants was included; the two groups contained different participants, and their instructions were not identical; and the testing environment represented a simulation of the biobank site (not the site itself). Hence, this testing may not generalize effectively to the real scenario with real participants. Challenges faced in the real setting may include a greater diversity of questions that will be asked to Alva and the possibility that participants may find it complex to use Alva while being tasked to do measurements in a new environment. This could result in a bias regarding the quality of the data collected from the measurements. To investigate this bias when Alva is deployed on the site, it is important that there should be a group of participants that do not use Alva. In this way, the data collected on the measurement between the two groups could be compared for distribution shifts.

Conclusions

This study investigated the effect of optimizing a chatbot for guidance and assistance during biobank data collection on its perceived value. As a result, the Alva V2 chatbot was planned for deployment at two of the biobank’s sites for 5 weeks in each location.

Authors’ Contributions

F.V.D.G. conceived and designed the study, co-developed the study surveys, performed the investigation, conducted data analysis, interpreted the data, drafted the article, critically revised the article, approved the final version, and agreed to be accountable for all aspects of the work. A.M. contributed to study design, codeveloped the study surveys, participated in data analysis and interpretation, critically revised the article for important intellectual content, approved the final version, and agreed to be accountable for all aspects of the work. G.M. developed the chatbot versions evaluated in the study, participated in interpretation of technical and outcome data, critically revised the article, approved the final version, and agreed to be accountable for all aspects of the work.

Footnotes

Acknowledgments

The authors, at Aevai Health B.V., would like to thank the biobank and the external test users for their support during the development of Alva. The authors also thank Naila Loudini for facilitating the initial connection with the journal.

Author Disclosure Statement

The authors disclose their conflict of interest as cofounders of Aevai Health B.V. and regarding the development of the chatbot (Alva) as a commercial product. In addition, F.V.D.G. is employed by Radboud University Medical Centre, A.M. is employed by the University of Groningen, and G.M. is employed by marXact.

Funding Information

This project has been cofunded by the European Union under the Development of EHealth Solutions Improving Resilience in Europe (DESIRE) Open Call under Agreement 2024/080. The rest of the financial contribution is received from Aevai Health B.V. and the biobank.

Supplemental Material

References

1. Veenstra

, Op den Buijs

, Pauws

, et al. Clinical effects of an optimised care program with telehealth in heart failure patients in a community hospital in the Netherlands. Neth Heart J 2015;23(6):334–340; doi: 10.1007/s12471-015-0692-7

2. Cevasco

, Morrison

, Woldeselassie

, et al. Patient engagement with conversational agents in health applications 2016–2022: A systematic review and meta-analysis. J Med Syst 2024;48(1):40; doi: 10.1007/s10916-024-02059-x

3. Dwyer

, Hoit

, Burns

, et al. Use of an artificial intelligence conversational agent (chatbot) for Hip arthroscopy patients following surgery. Arthrosc Sports Med Rehabil 2023;5(2):e495–e505; doi: 10.1016/j.asmr.2023.01.020

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB