Abstract
Background:
People with Alzheimer’s disease (AD) often demonstrate difficulties in discourse production. Referential communication tasks (RCTs) are used to examine a speaker’s capability to select and verbally code the characteristics of an object in interactive conversation.
Objective:
In this study, we used contextualized word representations from Natural language processing (NLP) to evaluate how well RCTs are able to distinguish between people with AD and cognitively healthy older adults.
Methods:
We adapted machine learning techniques to analyze manually transcribed speech transcripts in an RCT from 28 older adults, including 12 with AD and 16 cognitively healthy older adults. Two approaches were applied to classify these speech transcript samples: 1) using clinically relevant linguistic features, 2) using machine learned representations derived by a state-of-art pretrained NLP transfer learning model, Bidirectional Encoder Representation from Transformer (BERT) based classification model.
Results:
The results demonstrated the superior performance of AD detection using a designed transfer learning NLP algorithm. Moreover, the analysis showed that transcripts of a single image yielded high accuracies in AD detection.
Conclusion:
The results indicated that RCT may be useful as a diagnostic tool for AD, and that the task can be simplified to a subset of images without significant sacrifice to diagnostic accuracy, which can make RCT an easier and more practical tool for AD diagnosis. The results also demonstrate the potential of RCT as a tool to better understand cognitive deficits from the perspective of discourse production in people with AD.
INTRODUCTION
Alzheimer’s disease (AD), an age-related neurodegenerative disease, is the most commonly diagnosed form of dementia [1]. Previous studies have shown that structural and functional changes are gradually presented from the early phases of mild cognitive impairment (MCI) due to AD, and even preclinical phases [2, 3]. Impairments caused by AD gradually worsen problems with learning, episodic memory, and other functions which are recursive from cognition, such as language, executive functioning, and visuospatial skills. Consequently, these cognitive impairments may lead to loss of abilities to perform basic activities of daily living thereby severely degrading patients’ quality of life. Although cognitive interventions can curtail the decline in AD patients’ cognitive functioning, and in some cases even improve it, such interventions are most effective when implemented early in the course of the disease [4, 5]. For this reason, it is necessary to develop a sensitive assessment tool that can diagnose cognitive declines for early detection and treatment.
Language deficits may occur in early stages of AD and MCI [6]. Analysis of patients’ language changes is particularly useful for early identification of cognitive and linguistic changes [7]. Therefore, many language assessing tasks were designed to characterize a language profile in individuals with AD based on specific linguistic variables, such as syntactic complexity [8, 9], lexical content [10, 11], verbal fluency [12, 13], and semantic [14] or discourse aspects of spoken language [15]. However, previous studies have shown conflicting results on their diagnostic utility likely due to the inherent heterogeneities in the cognitive and linguistic deficits of people with AD, as well as differing experimental methodologies including experimental setting and variables selected to measure language impairments across studies [16, 17]. In particular, previous studies are largely based on laboratory tasks, which limit ecological validity [18, 19]. For example, a myriad of studies employed semantically heavy, but non-interactive tasks such as picture description and storytelling [16, 20].
A recent attempt to overcome these limitations employed referential communication tasks (RCTs) that enable clinicians to examine interlocutors’ language in a more natural and interactive setting [21]. RCTs place especially heavy demands on pragmatic and social skills that are ecologically valid and generalizable to everyday life. RCTs have been widely used to investigate speakers’ ability to produce referential expressions according to different communicative settings (e.g., the targeted referent’s accessibility, partner’s perspective) [22–31]. For example, in a traditional RCT, conversational partners collaborate to solve a task that is based on verbal communication alone. The partners are given a set of cards with abstract figures in a different order (see Fig. 1 for example). One of the conversational partners describes the figures with the goal of enabling the other partner to sort the cards into a common order. This is an interactive task repeated across trials to allow the conversational partners to converge on common understandings of the abstract figures. As such, it simulates aspects of natural conversation in everyday life and the changing properties of the spoken communication with time can be used as a measure of the extent to which conversational partners developed shared knowledge and used it while communicating.

Example stimuli during sorting (left) and testing (right) phases.
Recent research using RCTs has provided crucial insights about the underlying mechanisms that drive potential differences in language use between people with AD and healthy adults. Individuals with AD often fail to produce enough information as needed. They produced less efficient referring expressions (e.g., ambiguous pronouns [34]) and provided irrelevant or even wrong information, which could lead to the partner’s misunderstanding [32]. They were also less likely to adapt their language according to the partner’s perspective [35] or feedback [32]. This linguistic performance of individuals with AD, correlated with general cognitive functioning [35], can be challenging for conversational partners and often drives to communication breakdown. Further, individuals with AD have difficulty in establishing shared information with a conversation partner [33]. For example, Feyereisen et al. studied how two partners achieve common understanding through shared experience. The results showed that while people with AD benefited from task repetitions, they did not adjust language with respect to the shared understanding; they used fewer definite referential expressions and were less likely to produce shared language. Thus, RCTs especially adopting interactive conversational settings have been widely used and demonstrated its utility in the research that examines language performance in individuals with AD and healthy adults.
Although RCTs are gaining in popularity, more research is needed to establish them as effective tools to differentiate people with AD from cognitively healthy adults. Moreover, a chief concern is that they can be time-consuming to administer, and there is no evidence for the number of tasks or trials that are required for sufficient diagnostic sensitivity and specificity. To address these issues, we apply Natural Language Processing (NLP) and machine learning (ML) techniques to help us better understand the feasibility of using RCTs to aid in AD diagnosis. Information collected from the results will provide new insights into maximizing the effectiveness of RCTs as a diagnostic tool.
Natural language processing and machine learning in AD diagnosis
NLP is a research field in computer science which uses computational power to learn, understand, and produce language content [34]. Prior research has shown that NLP and ML techniques using various features from speech transcripts and their associated audio files can evaluate AD and differentiate people with AD from healthy control group [35–42]. The classification accuracy was good but had room to improve, ranging between 81–85%.
One approach to improving the performance of classification model is to explore AD associated variables from language products. Fraser et al. applied a machine learning classifier with linguistic variables from the transcripts and acoustic variables from the associated audio files from the DementiaBank corpus to distinguish between participants with AD and healthy controls. They obtained the classification accuracy of over 81% [35]. Searle et al. used the same manuscript and compared the performance of using transcript and utterance level of features extracted from deep-learning transformer-based models as inputs in different ML classification models. They reported that features extracted from Term frequency-inverse document frequency vectorizer, which is a transformer that reflects the importance of each token to the document, as input into a support vector machine model reached the highest accuracy, 81% [36].
Another approach is to manage ML classification model to better distinguish the feature difference between two groups. Karlekar et al. compared three neural models based on Convolutional Neural Network (CNN), Long Short Term-Memory (LSTM), Recurrent Neural Network, and their combinations with language features extracted from participants’ manual transcripts collected in the Boston cookie theft description task to diagnose AD. The result of combined CNN-LSTM models achieved the highest accuracy 84.9% [41]. In another study, Fritsch et al. utilized the same dataset and derived features extracted from N-gram language models [37] as input to a designed neural network language model with LSTM cells. The result was achieved 85.6% classification accuracy at equal-error rate [38].
The methods described above, however, may be of limited use when applied to RCT transcripts. This is because the strength of RCTs as a diagnostic measure lies in part in its sensitivity to discourse characteristics, and it is therefore necessary that NLP or ML approaches to automatic classification be able to measure such adjustments when they are made (or not made) by patients. Consequently, NLP and ML approaches cannot rely on basic metrics such as word count or mean length of utterance but must instead access conceptual information present in the transcripts.
In recent advances in the development of ML, NLP techniques start relying on word-based understanding to exploit semantics which specifies the conceptual information associated with word entities [43]. Most are trained on word associations in either a left-to-right or right-to-left context. However, one state-of-the-art pre-trained NLP model, the Bidirectional Encoder Representations from Transformers (BERT) model, learns bidirectional contextual relations between words or sub-words in the input transcript. Compared to BERT, other previous transfer learning models which use left-to-right or right-to-left scan to represent each token can only consider the previously encountered tokens, but not the subsequently encountered tokens (i.e., those occurring after the current word). In tokenized analysis, unidirectional approaches significantly reduce the performance of the model because the discourse semantics of a word depends on the whole context, which includes words occurring both before and after a given word. BERT overcomes the limitation of unidirectional text order by formalizing the task to predict randomly selected and masked words in a text with the surrounding context. By using this method, BERT can provide contextualized word representations in the extracted feature vectors [44]. By pre-training on a large text corpus as a language model, BERT can encode a context-sensitive embedding for each word in a given sentence.
Compared to other NLP transformers, BERT has been observed to be superior in detection of AD from transcripts of speech. Novikova et al. utilized two methods to extract language features from participants’ manual transcripts collected in the Boston cookie theft description task to diagnose AD, which were the features extracted using domain knowledge and BERT. The features from domain knowledge included commonly utilized features in AD diagnosis, which were lexico-syntactic features, acoustic features, and semantic features. These two types of features were treated as inputs to the same classification model to compare the performance of classification. The result revealed that the fine-tuned BERT model achieved the highest accuracy in the classification task [45]. With the advantages of providing contextualized word representations, BERT is assumed to extract more distinguishable features than those obtained by linguistic analysis using transcripts from RCTs. Therefore, BERT has the potential to be an efficient tool for evaluating discourses during RCTs.
In the present research, we aimed to develop a methodological approach of combining RCT with NLP to differentiate people with AD and cognitively healthy older adults based on cognitive and linguistic deficits observed in their referential communication, and to further develop the RCT process to maximize its performance for the purpose of AD detection. We predicted that using a measure of the participants’ ability to adjust referential expressions in RCT as input of NLP can potentially have outstanding performance in AD detection.
METHODS
Dataset
In this study, we employed a pre-existing dataset of transcripts from an RCT experiment [48]. The dataset comprised manual transcripts from 12 older adults with mild-to-moderate AD and 16 cognitively healthy older adults [1]. The AD participants were diagnosed by their neurologists and met the criteria in McKhann et al. [1]. They were evaluated with the Mini-Mental State Examination-2 to ensure that they demonstrated mild-to-moderate dementia severity. Biomarkers were not collected for the participants since the study was not concerned with pathophysiology of AD. The current study protocol was reviewed and approved by Institutional Review Board of the University of Tennessee Health Science Center. The demographic information is shown as Table 1. The education level and gender were not significantly different between two groups, t = 1.32, p > 0.05, and t = –0.35, p > 0.05, respectively. The MMSE-2 score and their age between two groups were significantly different, t = 3.24, p < 0.05, and t = –3.05, p < 0.05, respectively. Healthy participants had higher MMSE-2 scores and were younger than individuals with AD.
Demographic information (mean and standard deviation) for each group
MMSE-2, Mini-Mental State Examination Second Edition; DSRS, Dementia Severity Rating Scale.
The RCT experiment procedure
The experiment consisted of two sets of an RCT, each of which included a sorting phase and a testing phase (Fig. 2). Each set included a different set of images.

Schematic illustration of the experimental set up.
In the sorting phase, the participant was given a set of 12 abstract image cards shuffled into a random order and the experimenter was given a booklet with the same 12 images in a certain order. Before the task began, the experimenter described the goal (to sort the image cards into the instructed order) and the process (to sequentially pick the next image card based on description) to the participant. During the sorting phase, the experimenter described each of the 12 images to the participant and the participant rearranged the image cards accordingly. During the task, participants were encouraged to ask questions if necessary. As shown in Fig. 3, a barrier was set up between the participant and the experimenter to ensure that they could not see each other’s image cards, but they could see each other’s face to allow them to make eye contacts and see each other’s facial expressions. The sorting task (re-arranging all 12 image cards in a particular order) was repeated at least four rounds (each round aimed for a different order). If they made errors, they repeated the task up to nine rounds until participants successfully sorted the images without errors for two consecutive rounds. Across repetitions, participants learned labels for each of the 12 abstract images (e.g., “the cell” would be the label for the sample image on the top left corner in Fig. 1, left).

A barrier is placed between the experimenter and the participant such that they cannot see each other’s image cards but can make eye contact.
After the sorting phase, two experimenters were involved in the testing phase: One was the same experimenter who participated in the sorting phase (knowledgeable partner), and the other one was new and had no experience with any image labels (naïve partner). The two experimenters (knowledgeable and naïve partners) sat in front of the participant to have a conversation. A barrier was used to separate the participant and the two experimenters (see Fig. 3). All three people had a separate booklet and could not see each other’s booklet. On each page of the booklet, four images were presented including three old images that appeared in the sorting phase and one new image. In the participant’s booklet, one target image was highlighted by a black box on each page (see Fig. 1, right). Participants were requested to describe the target image in the box either to the knowledgeable experimenter or to the naïve experimenter. In the experimenter’s booklet, only the four images without the black box were shown. The participant described the target images to either experimenter prompted by the instruction shown on the center of each page (e.g., “Describe the picture in the box to the partner A.”). The appointed experimenter marked the targeted image on their booklet based on participant description and the other experimenter proceeded to the next trial. During the test phase, the experimenter provided minimal feedback when they needed to prompt the partner’s clarification (e.g., “Could you explain this more? Could you clarify it again?”). The experimenters consistently followed the same protocol across all participants and never interrupted the participant’s utterances. The testing phase had 24 trials in a set: 12 trials referred to the familiar images and 12 trials referred to the unfamiliar new images. They completed a total of two sets of sorting and testing (see Paek & Yoon for details [46]).
Transcript encoding and preprocessing
Both full transcripts and initiating reference transcripts during the test phase for each participant were analyzed. The initiating reference transcripts were adopted from the work of Paek and Yoon [46]. All referential expressions included function words and content words, but not filler words (e.g., um, uh) [46].
Initiating reference refers to a participant’s language production before they received any feedback from their partner [47]. It has been well established that the partner’s verbal feedback can dramatically change speaker’s production. Thus, initiating reference is often used to control the effect of the partner’s feedback on speaker’s production [29, 49]. Although full transcripts and initiating references do not make a significant difference in the results in behavioral studies [47], we employed both measures in the current research to examine their effects in our new approach.
Linguistic feature extraction
We used both a traditional feature-based analysis of RCT transcripts and a discourse level approach using BERT to get a complete picture of how RCTs may be used in evaluating AD.
For the linguistic features’ extraction, 12 selected features which have potentially high relevance with age-related language impairment were converted from each input transcripts [50]. Details of selected features are described below in Table 2.
Summary of all linguistic features extracted
Deep representation feature extraction
According to the advantages of handling contextual information extraction due to the bi-directional ability, BERT was implemented as an encoder to transfer transcripts to high-dimensional feature vectors for the AD classification purpose. Regarding the efficiency of execution in further AD detection design, the Small-BERT model with 4 hidden layers and 512 hidden neurons was implemented to extract 512 features for each input transcript [63]. The extracted features were used as an input to the classification model which produces the output data into two classes: people with AD and cognitively healthy older adults.
Machine learning architectures
To evaluate the performance of transfer learning in AD detection using RCT, we planned to compare classification performance between two feature types: machine learning features and linguistic features. These two types of features were extracted from the speech transcripts and were then used as input to train a binary classification model for AD diagnosis.
Regarding the high inter-subject variability of image descriptions caused by image differences as well as their linguistic patterns, the feature extraction strategy was important to the classification performance. Therefore, analysis was completed on 7 designed feature extraction approaches. The 7 approaches were differentiated with transcripts of familiar images vs. unfamiliar images vs. all images (both familiar and unfamiliar images) as well as combined features vs. individual features. Familiar/unfamiliar images were referred to as the images that participants have/have not seen in the sorting phase.
To examine the potential effect from participants’ linguistic patterns to the classification performance, transcripts were grouped in two types: individual and combined. The individual type assigned the task which using transcript of each image description as the input of classification model. The combined type was the task which combining all transcripts of the targeted type of images as the input. Figure 4 showed the architecture of 7 feature extraction approaches.

Architectures of the 7 designed feature extraction strategies. Tasks shown in this figure refer to (a): Task1, (b) Task 2, (c) Task 3, (d) Task 4, (e) Task 5, (f) Task 6, (g) Task 7.
The comparison between image types was to examine if acquiring novel knowledge from images was necessary to differentiate between people with AD and cognitively healthy people, which was assessed with a pilot evaluation in previous study [46].
AD versus non-AD classification and evaluations
The classification model contained two dense layers: 1) The first layer included 50 hidden neurons with rectified linear unit (relu) as the activation function. 2) The second layer was the output layer with sigmoid as the activation function.
Due to a relatively small size of training data compared to the general sample sizes needed for deep learning algorithms, we reported the performance of proposed approaches from a leave-one-out-cross-validation (LOOCV) procedure. LOOCV is a type of cross-validation that the number of folds equals the number of instances in the data set. In each fold, this cross-validation method uses all other instances as a training set and apply the select instance as the only test set. Therefore, each instance processes the learning algorithm.
To evaluate and compare the performances between the 7 chosen feature extraction strategies, test validation was involved to calculate three objective measures of test performance: accuracy, sensitivity, and specificity. Figure 5 demonstrates the results of a diagnostic test example as well as the evaluation methods of three measures.

Results of a diagnosis test example.
Each task was repeated 20 times to minimize the variability of diagnosis performance of the 7 tasks. The comparative performance of each task was reflected on the test validation, which includes the comparison of accuracy, sensitivity, and specificity.
RESULTS
Tables 3 and 4 show the mean and standard deviations of classification results based on LOOCV from full transcripts and the initiating reference transcripts, respectively. Overall, the range of accuracy for the full transcript was 90.9–99.8% and 92.8–98.3% for the initiating references. For the full transcripts, the best classification accuracy was 99.8%, obtained in Task 4 that used features extracted from combined transcripts of all images presented the significant performances. For the initiating reference transcripts, the best classification result was 98.3%, obtained in Task 4 that used features extracted from transcripts of all images presented the significant performances.
Mean and standard deviation of classification accuracy (%) results of 7 tasks with full transcripts and initiating reference transcripts using BERT output as features
Acc, accuracy; Sen, sensitivity; Spec, specificity.
Table 4 presents the classification results using linguistic features generated with full transcripts as well as initiating reference transcripts. Tasks 1–4 were indicated approaching 50% of accuracy. Task 7 that used separated full transcripts for all images as input achieved highest accuracy, 96.9%. Task 5 that used separated initiating reference transcripts for familiar images as input outputted the highest performance, 95.0%.
Mean and standard deviation of classification accuracy (%) results of 7 tasks with full transcripts and initiating reference transcripts using linguistic features
Acc, accuracy; Sen, sensitivity; Spec, specificity.
Sensitivity analysis
Regarding the high performance shown in the tasks using transfer learning algorithm, to evaluate the significance of each image of 48 images and find approaches to simplify the experiment which can still maintain a relatively high performance, the classification test went through individual image transcripts as the input. The tests were conducted in the same approach shown as Fig. 4 expect using transcripts of individual image description tasks.
Table 5 shows the classification results of each image with initiating reference transcripts. Unfamiliar images represented higher performance than old images. The range of accuracy between each familiar image was from 79.8–99.6% as well as 94.6–99.1% for unfamiliar images. Table 6 demonstrates the classification results of each image with full transcripts. Familiar images represented relatively higher performance than unfamiliar images. The range of accuracy between each old image was from 93.5–99.6%, and 92.6–99.1% for new images.
Mean and standard deviation of classification accuracy (%) results of individual unfamiliar images (top) and familiar images (bottom) using initiating reference transcripts
Mean and Standard deviation of classification accuracy (%) results of individual unfamiliar images (top) and familiar images (bottom) using full transcripts
Analysis controlling the age difference between the groups
Although the control group is significantly younger than the AD group, we hypothesized that the changes in language performance between the two groups are due to cognitive differences rather than age differences. To test this hypothesis, we carried out secondary analyses, where we removed the three youngest participants (age 65, 68, and 69, all in the control group) and the oldest participant (age 89, in the AD group) from the dataset. Comparing between the modified control group and the modified AD group, participants’ age was not significantly different (t = 2.02, p > 0.05) and their education level and gender remain comparable (t = –1.29, p > 0.05, and t = 0.55, p > 0.05, respectively). Not surprisingly, the MMSE-2 score between two modified groups was significantly different, t = –2.44, p < 0.05. We repeated the tasks in Fig. 4 for the two modified groups using BERT analysis (Table 7) and linguistic analysis (Table 8), respectively. The results in Tables 7 8 show that AD can be detected using BERT and linguistic features, and the performances using BERT are overall better than linguistic features, consistent with the results in Tables 3 4. We carried out t-test analyses to compare corresponding tasks in Table 3 versus Table 7. Among the 14 tasks, only two tasks yield significantly different results between Tables 3 7; see items marked by stars in Table 7. Similarly, we further carried out t-test analyses to compare corresponding tasks in Table 4 versus Table 8. Among the 14 tasks, no task has significantly different results between Table 4 and 8. Results of these secondary analyses increase the confidence that our analyses reveal the differences between the control group and AD group due to cognitive deficits.
Mean and Standard deviation of classification accuracy (%) results of 7 tasks with full transcripts and initiating reference transcripts using BERT output as features extracted from the updated dataset
Acc, accuracy; Sen, sensitivity; Spec, specificity. All results in Table 7 are not significantly different than those in Table 3, except items marked by *.
Mean and Standard deviation of classification accuracy (%) results of 7 tasks with full transcripts and initiating reference transcripts using linguistic features extracted from the updated dataset
Acc, accuracy; Sen, sensitivity; Spec, specificity. *significance at p < 0.05 in the conducted t-test). All results in Table 8 are not significantly different than those in Table 4.
DISCUSSION
This study aimed to investigate the use of RCTs to develop an AD diagnosis system based on the language transcription data collected in Paek & Yoon [46]. We used both a traditional feature-based ML approach as well as the more discourse-based approach of BERT. Overall, the main results showed that RCTs can be used effectively to aid in AD diagnosis. High-dimensional feature vectors extracted via BERT achieved superior performance compared to results from analyses of linguistic features alone and this result suggests that using BERT to aid in analysis of RCT transcripts may be a promising approach. Moreover, it may be possible to simplify the RCT experiment by reducing the number of images used in diagnosing AD since the performance of classification derived from using transcripts of individual image descriptions with BERT encoder reached an accuracy similar to that of classification derived from transcripts of all images (N = 48).
BERT is not a unique technology in the diagnosis of AD. Previous studies have already utilized BERT as the tool for differentiating between people with AD and normal control group by using their transcripts in different tasks [64]. However, the present study demonstrated superior performance in the diagnosis of AD than other earlier studies. The most significant difference in the current study than others is that our experimental task involved a more complex and higher level of social goals and interactions by using RCTs to elicit utterances from people with AD. We hypothesized that, comparing with other connected speech experiments (i.e., Boston Cookie Theft Task), the proposed RCT exposed higher dimensions of deficit features through the pragmatic language from people with AD.
In the current study, basic linguistic skills were examined via the analysis of linguistic features. Comparing Tables 3 4 reveals that the classification performance using BERT analysis is better than that using linguistic features alone. This finding indicates that BERT analysis is more likely to reveal the discourse level impairment since BERT analysis extracts not only linguistic features but also semantic representations of the transcripts [65]. Early language deterioration in AD has been hypothesized as a consequence of deficits in semantic memory rather than global cognitive decline [66]. Lexical semantic memory is considered as essential for many aspects of cognition, which includes reasoning, planning, and social interaction [67]. Moreover, prior studies revealed significant implications of the role of social interaction as a factor in facilitating new semantic learning in individuals with memory impairment [68]. In contrast to other connected speech experiments, the designed RCT emphasized the structure of communication and social interaction skills by encouraging the participants to describe the target image so that the conversational partners understand and identify the targeted one. We postulated that the superior performance of classification is attributed to the interactive communicative settings in RCTs.
Multiple prior studies have demonstrated that deficient episodic memory capacity occurs in the early stage of AD [69, 70]. Regarding the theory Mahr and Csibra proposed, episodic memory allows us to communicatively support our interpretations of the past by gauging when we can assert epistemic authority [71]. As stated by the theory, the recollection generated from episodic memory supports the speaker to communicate the reasons of holding certain beliefs about the past understanding. In the designed RCT, the familiar images and shared language reactivated the episodic memory record. When describing the familiar images, the speaker was required to access their episodic experience about the image shown in the sorting phase to produce an appropriate description (i.e., shared language). Since people with AD primarily suffer from episodic memory disorder, the disease supposed to impact the mentalizing abilities and revealed different classification performances of analyzing individual unfamiliar images. However, the results show similar performances between using individual familiar and unfamiliar images. Therefore, we assumed that the episodic memory may not be the domestic result of having the superior performance.
Another thing to note was that we found a larger range of accuracy when using initiating reference transcripts from individual familiar image descriptions than when using full transcripts from individual familiar image descriptions, which contained the mean and the stand deviation as 93.0% (4.7%) and 96.9% (1.3%), respectively. According to the definition of initiating reference which refers to the discourse before experimenter’s intervention, it was possibly because participants may use shared language with the experimenter before any feedback is given and adjust their descriptions across interaction with experimenters. In this process, the influence of individual image complexities was gradually increased by the dynamics of referential adjustments over the course of the conversation. Further studies are required to investigate the differences of each discourse process in different image description tasks to examine the influence of partner’s feedback in RCTs.
The cognitive demands of referential communication are complex [72]. In this continuous and interwoven process, multiple cognitive functions (e.g., working memory, attention, etc.) on the basis of linguistic features effect on the performance of classification. Unlike previously published analyses of RCTs, the present study analyzed not only linguistics, but also semantic representations from transcriptions. As a result, the extracted representations of input sentences were highly dimensional and impracticable to isolate. However, regarding the architecture of BERT and the comparison analysis conducted in the present study, we speculated that the deterioration of semantic representations may be the key of understanding the superior performance of classification. To examine our speculation, further study will be continuously examined the semantic representation differences and discussed the cognitive mechanisms of people with AD in communication process.
A main limitation in the present study is the relatively small sample size as the participants were recruited from the community. Due to the small sample size and the complex evaluation models, the limitation weak the performance of the diagnosis platform, especially to evaluate individuals with AD with considerable variability and frequency. One caveat of the current analysis is the possibility of false positive results due to small sample size. Further research with larger and balanced samples will be necessary to confirm the performance of the proposed diagnosis platform.
Conclusion
In this study, we compared the results of AD diagnosis using two sets of features: linguistic features vulnerable to AD and transfer learning features with BERT encoder using transcript from an RCT experiment. The aim of this study was to examine the application of transfer-learning algorithm in RCT for the purpose of AD diagnosis. Our results showed that transfer learning models achieved superior performance which demonstrated its feasibility as a diagnostic tool for people with AD in future studies. Based on the comparison between 7 designed feature extraction strategies, combing transcripts as the input of BERT encoder showed significant performance in the analysis. Moreover, through the sensitivity analysis, using transcript of individual image as input could still maintain a relative high performance in the classification task. The result provided the potential of using fewer images for an efficient RCT experiment, maintaining high classification performance. The result of investigating the performance of transfer learning in RCT suggests that RCT has the potential to be an AD evaluation tool with higher accuracy.
Further research is necessary to verify the result with larger sample size, where the transfer learning model may deliver more insights on understanding the significant performance. Additionally, more people with AD in early stage, such as MCI, will be recruited in future research to examine the performance of proposed algorithm for diagnosis of early AD to evaluate the feasibility of diagnosing people in the early stage for the early treatment. Further work will also build on these results to develop improved diagnostic tools for disease screening and monitoring in AD, which can further increase the efficiency of early identification and treatment. In the future, we will analyze the difference of designed abstract images and simplify the experiment by reducing the number of images to develop an effective AD screening tool.
DISCLOSURE STATEMENT
Authors’ disclosures available online (https://www.j-alz.com/manuscript-disclosures/21-5137r2).
Footnotes
ACKNOWLEDGMENTS
This work is in part supported by NIH NIA R03 AG072236-01 awarded to EJP and SOY.
