Abstract
As the use of voice assistant (VA) systems is increasing, conversation design in the system is important for effective human-system interaction. The objective of the study was to investigate the level of user preference for VA outputs in terms of linguistics. Answers of three VA systems for each of the nine questions were collected and categorized for distinctive linguistic factors such as type of theme, thematic progression, number of predications, and ellipsis. The VA answers were evaluated through an online survey. Results show that linguistic factors and features significantly affect user preference for VA outputs. The results imply that the linguistic features need to be considered for designing voice interaction communications as a natural interaction method.
Introduction
Conversational Interfaces and Voice Assistant Systems
Conversational interfaces are becoming more popular, and transformative changes are expected in computer interface design guidelines. Human-conversational interfaces are dialogue-based artificially intelligent systems used for human-computer interaction. Three main types of conversational interfaces are distinguished by their mode of input and output: a) speech-based user interfaces, where the user’s voice is input and the system’s audio response is output; b) text-based user interfaces or chatbots, which can only be activated by text, but may also include images, video, point-and-click interaction, etc. (Candello & Pinhanez, 2016); and c) multimodal user interfaces, which incorporate both voice and visual elements (Stanciulescu, 2008). Voice assistant (VAs) systems are spoken dialogue interfaces between humans and computational agents. Over a decade, the development of such natural language comprehension (Bellegarda, 2014) has led to commercial products leveraging technologies such as Siri, Alexa, Google Home, etc. Most comparisons and evaluations of VAs are conducted by the companies that produced the assistants and the main objective of these studies is to introduce and test new skills or evaluate the capacity of VAs to perform tasks. User satisfaction is measured by the assistants’ capacity to understand the assigned tasks (Clark et al., 2020).
Linguistic Factors and Features in Voice Assistant Systems
Type of the theme
In defining the type of the theme, the guiding principle has been Halliday’s hypothesis of theme and rheme as the fundamental structural pattern in English (Halliday, 1994). The message’s starting point is the theme or topic, which is determined by its position in the utterance. The rheme or comment is the clause’s theme-developing portion. Thematic positions are occupied by the subject (viewed as a semantic category, not only grammatical) in unmarked themes and by other parts of the utterance in marked themes. For example, in the collected VA system responses to a human question in the present study the following theme-rhematic structure is observed:
Q. What’s the weather today in Washington, DC?
A1. It’s currently clear and 18oC in Washington, DC (Unmarked)
A2. The forecast is around 15oC and clear. Right now it’s 18oC and clear. (Combination of marked and unmarked themes)
A3. Right now in Washington, United States, it’s 66oF with clear skies. Tonight you can expect mostly clear skies with a low of 54 degrees (Marked)
Thematic progression
Thematization—the development of the theme throughout the conversation—is related to thematic progression. Thematic progression is how speakers connect topics over greater pieces of discourse (Daneš, 1974). The main types of progression are constant progression, where a theme that is not new derives from the previous theme(s); and linear progression, where the theme derives from the previous rheme (Daneš, 1974). Other types of progression refer to themes that could not be related to earlier material and therefore showed no progression. These include new themes (Herriman, 2011) and grammatical themes (McCabe, 1999), such as the expletives ‘it’ and ‘there’, which act as a grammatical subject but have no referent. The following answers to the question show examples.
Q. Who won the last Champions League Championship?
A1. Manchester City were defeated by Chelsea 0 to 1 in the Champions League finals on May 29, 2021. (Other - new theme)
A2. The current champion of the Champions League is Chelsea. (Linear)
A3. According to an answers contributor: Real Madrid have won the most Champions League titles, with the club winning the trophy a staggering 13 times. (Other - new theme)
Predication
Predication is the relation between a semantic subject and a predicate, through which something is asserted or denied of the subject (Searle, 1969). Primary predication is the relationship between the sentence’s grammatical subject and predicate, expressed by finite verbs. Secondary predication occurs at the phrase level between non-finite verb forms and nominals. Secondary predication compresses information into a sentence. It overloads the semantics while decreasing the number of language means used to construct a proposition (Sargsyan, 2017). Therefore, for the purposes of this study, it is more appropriate to measure the complexity of the utterances produced by the VAs by the number of primary and secondary predications they contain rather than the number of language means used to produce those utterances. Examples are presented below.
Q. What’s the shortest route from New Mexico to Nevada?
A1. Getting directions. If you don’t see them, tap the notification on your iPhone when it’s safe to do so.
A2. Alright, the best way to get from New Mexico to Nevada by car is via I-40 W. It will take about 15 hours and 42 minutes in light traffic.
A3. Getting directions from Santa Fe to Carson City.
While A3’s response stands out as the least complex and the shortest, it is not quite clear how A2’s and A1’s responses should be treated in this regard. If complexity were measured by the number of language means used, it seems that A1’s response is less complex and lengthy than B’s. However, the analysis of the semantic structure of the two utterances shows A1 contains 5 predications (4 primary and 1 secondary predication), while the semantic relations in A2’s response are expressed through 3 predications (2 primary and 1 secondary predications).
Ellipsis
Ellipsis is the removal of a word or group of words without changing the meaning. In an elliptical structure, a word or words are absent but may be easily recovered, and the context makes the meaning evident. Ellipsis occurs to avoid repeating a word or phrase from the previous chunk of speech, which is implied by the inner or outer context of the conversation (Halliday & Hasan, 1976). Consequently, ellipsis is another means of text compression in English (Sargsyan, 2017). For example,
Q. Set a timer for cooking fish for 40 minutes, please.
A1. Cooking fish timer. 40 minutes. Starting now. To hear this timer, please make sure you’ve turned off Do not disturb and Silent mode. (Elliptical)
A2. 40 minutes, counting down. (Elliptical)
A3. Alright, a 40-minute timer called cooking fish. And we’re starting now. (Non-elliptical)
This distinction serves as an additional criterion to interpret the received ratings in terms of the respondents’ preferences in favor of longer non-elliptical or shorter elliptical responses by VAs.
Objective of the study
Many studies have compared the leading commercial VAs in popularity, market sales, skills, and tasks they complete for users (Cowan et al., 2017). Most of the limited studies on VAs’ conversational abilities compare intent-matching capacity by response accuracy (López et al., 2017) and task completion rate (Oviatt et al., 2008). Several studies evaluate one VA without comparing it to others (Pyae & Joelsson, 2018). Few studies tested these popular VAs’ naturalness (Berdasco et al., 2019). No official conversation design principles or thorough VA studies have been done to inform conversation designers on language forms and discursive structure of speech output (Clark et al., 2019). There is little research done on their conversational design from the standpoint of the impact that the linguistic features of the voice output have on the user experience. In this regard, the current study tried to evaluate the perceived likability of the utterances produced by three top-selling VAs in response to posed queries, as conditioned by the type of the theme and thematic progression, number of predications explicitly or implicitly contained in the utterance as well as the presence of ellipsis, or lack thereof, in the utterance constituting the response. Consequently, the objective of the study was to evaluate user preference for voice outputs from VAs using linguistic factors to generate VA conversation design guidelines.
Methods
A survey was conducted to collect users’ preferences for VA outputs. In order to present VA outputs in the survey, recorded audio files of VA output answers for questions were prepared and included in the survey.
Questions and Answers from VA
VA outputs from commercial VA products were collected and prepared for the online survey. That is, we asked the same questions to the three top-selling VAs and recorded the answers from the system. Then the answers were transcribed to prepare actual voice files to be used in the survey. To avoid biases from varied system audio voice tones, the responses were re-recorded in the same machine voice using a TTS (text-to-speech) system. The collected VA outputs also were analyzed in terms of four linguistic factors mentioned in the previous section along with their associated features which are levels of each factor. Table 1 presents sample questions asked to VAs, answers to the questions from the three systems (A, G, and S), and linguistic factor and feature classification. The VAs’ responses were presented anonymously so as not to expose the brand of the VA.
Answers from voice assistant systems for sample questions and linguistic factors.
Survey Design
An online survey form was developed including instruction, informed consent, ratings on VA outputs, and demographic questions. In the rating session, the recorded VA outputs were presented for 9 questions. Thus, a total of 27 rating questions were presented consisting of 3 answers from VAs for each of the nine questions posed to VAs. Participants were asked to evaluate their level preference for VA outputs after listening to the recorded audio files of VA answers for given questions, using a 7-point Likert scale (1: dislike to 7: like). Figure 1 shows a screenshot of the survey form including questions to VA systems (e.g., "what’s the shortest route from New Mexico to Nevada" at the top of the figure), play buttons for each VA answer, and the rating scales. The identification of the VA system for answers was not presented and the order of answers was randomized to avoid bias.

Screenshot of the online survey form.
Survey Response Collection
The online survey was disseminated across the entire U.S. after obtaining IRB exemption approval from the University of Michigan (HUM00202203). In the survey invitation and instruction, it was stated that only adults aged over 18 years old could participate. Respondents were entered into a raffle for a $10 gift card to encourage participation. Consequently, a total of 553 responses have been collected. Among the responses, 213 responses were used for analysis as a final dataset, after filtering invalid data such as incomplete responses and reckless responses (e.g., all ratings are the same and rating without listening to the audio files).
Results and Discussion
Effect of Systems and Questions on the Rating
Analysis of variance (ANOVA) results revealed the main effects of Question (F(8, 5566)=16.75, p<0.001) and System (F(2, 5566)=12.62, p<0.001) as well as the interaction effect of them (F(16, 5566)=15.70, p<0.001) to be significant. Post-hoc analysis using the alpha level of 0.05 revealed that the systems G (M=4.57) and S (M=4.48) are preferred to system A (M=4.35) across questions. Figure 2 shows the interaction plot of the ratings for the three systems and nine questions. As shown in Figure 2, while general ratings are higher in G and S and lower in A (for example, in Q3 and Q8), some significant interaction can be observed. In Q2, system G yields higher ratings than systems A and S. In question 7, the rating of system A is higher than systems S and G. The reasons for these observations might be explained while examining the effects of the linguistic features of each answer.

Interaction plot of questions and systems.
Effects of Linguistic Features on the Ratings
ANOVA results also demonstrated significant effects for all four linguistic features on preference ratings, including the type of theme (F(2,5582)=9.11, p<0.001), Thematic progression (F(3, 5582)=41.19, p<0.001), number of predications (F(4, 5582)=6.38, p<0.001), and ellipsis (F(1, 5582)=17.16, p=0.008). A Quantification theory type I including linear regression (Florio & Jones, 2021) was employed to assess the amount of contribution of the four linguistic factors as well as their features (level) toward ratings. Table 2 shows the partial correlations of the four factors with overall ratings which imply the amount of contribution to the preference ratings. The table also shows the quantified values of each level of linguistic features which were calculated using the coefficients and center values of each factor.
Answers from voice assistant systems for sample questions.
As shown in Table 2, the “Thematic Progression” yields the highest contribution to the preference rating among the four factors, followed by “Number of Predications,” “Types of Theme,” and “Elliptical”. Among features of the “Thematic Progression”, constant featured answers may improve preference while combination featured answers may degrade the preference. This implies that users need to be certain that the VA has understood the query accurately. In case of constant progression, by repeating the theme of the user’s query, the VA reassures users that it has got the command right.
In a similar manner, combination (in Type of Theme), moderate number (around 3-5) of predication, and non-elliptical featured answers are preferred by users. The preference for a combination of unmarked theme followed by a marked theme suggests that dynamically unfolding information is associated with higher user satisfaction: once the topic (signaled by unmarked theme) of the dialogue has been successfully set, users expect the VA to provide contextually relevant new pieces of information (signaled by marked theme). Likewise, higher ratings for responses with a moderate number of predications and non-elliptical constructions highlight the need for accuracy and elimination of any possible ambiguity in interaction with computational agents: syntactically complete linguistic forms with less context-dependent implications result in more positive user sentiments.
Conclusion
The current study examined how linguistic features in answers from VAs are related to users’ subjective preferences throughout an empirical study. The general results revealed that different linguistic factors and features significantly affect preferences and it can be used for designing voice outputs in the systems to improve usability including user satisfaction.
However, there are a couple of major caveats in the study. First, the quality of information and relevance to the question in the answers were not evaluated in the study since the contents of answers used in the study were similar across the three VAs. Second, there would be more linguistic factors and/or features affecting user preferences that are not considered in the study. Third, users’ individual differences such as age, gender, environment, mental model, etc. were not accounted for in the data analysis. In addition, the physical features of VA outputs such as tone, volume, and speed may affect user preferences. Lastly, the conversations were too brief and it is necessary to have a proper exchange of questions and responses. In line with the Gricean maxims of quantity and manner, longer conversations with established context may provide more significance for the use of ellipsis. Therefore, to adhere to the norms of scholarly writing, it is important to maintain a comprehensive and coherent flow of information through well-structured conversations.
Further studies are expected to be conducted to overcome the limitations of the current study as well as to generate design guidelines for effective voice interaction in VAs. The current study can serve as an initiation of designing VAs by considering linguistics, as the VAs become increasingly important interaction devices in daily life.
