Abstract
This study presents a structured evaluation of the methodological potential and limitations of using AI-assisted deductive coding in qualitative content analysis, drawing on newly integrated AI coding features in the qualitative data analysis software MAXQDA. Our approach takes a pragmatic stance, positioning AI coding as a promising emerging tool requiring careful methodological consideration for its effective integration into existing workflows. Focusing on six deductive categories derived from Glock’s dimensions of religiosity, as refined by Huber’s Centrality of Religiosity Scale (CRS), we examine how AI-generated codes can support traditional deductive coding practices within narrative interviews about religious turning points. Through a two-phased comparative analysis of 101 cycles of AI coding, we focused first on the inputs to AI-assisted coding, exploring how the complexity of coding instructions impact AI outputs. Second, we focused on the outputs, assessing AI-generated codes against manually coded segments by examining three distinct aspects: the overall degree to which AI and manual codes match (correspondence), how well the segments align in terms of their textual boundaries (segment scope), and the level of thematic detail they capture (code granularity). Our findings showed that highly complex coding instructions narrowly targeting the central research questions yielded the most stable results. AI-generated codes captured 64% to 66.5% of manually coded segments on average but were typically broader in scope and less fine-grained, often containing multiple codable segments. Additionally, 47.6% of AI coded segments were found not to align with manual coding. These results support a hybrid approach to AI-assisted coding of qualitative data in which AI generated codes can serve as a useful reference for human coders, but in which human coders render final judgment in the segmentation and validation of coded segments.
Keywords
Introduction
Recent integration of AI coding into qualitative data analysis software (QDAS) raises new questions about the employment of AI alongside the traditional process of coding qualitative data using QDAS. Beyond AI-assisted open coding of text documents, the most recent innovation in AI coding within QDAS allows for AI-assisted deductive coding, a more targeted type of coding guided by specific coding instructions. This article examines the practical integration of AI-assisted deductive coding through a case study situated in the empirical study of religion. Taking a pragmatic approach, we consider AI coding as a promising emerging tool whose effective adoption into existing workflows requires careful consideration. To support this integration we conduct a structured evaluation of two essential features of AI coding, its inputs—the coding instructions or prompts that guide AI in addressing research questions—and its outputs—the AI-generated codes that constitute the response to such instructions. Our inquiry boils down to two central questions: (1) How should coding instructions in AI-assisted deductive coding be structured to optimize outputs when applying theoretical dimensions of religiosity to narrative interview data within QDAS environments like MAXQDA? and (2) How should AI-generated codes be leveraged to augment the traditional process of deductively coding qualitative data in religious studies using QDAS tools? Guided by our findings, we then discuss the implications for useful incorporation of AI-assisted deductive coding within MAXQDA in the context of theory-driven qualitative research on religiosity.
While AI-driven technologies for features such as automated transcription or topic modelling, or even for something as seemingly routine as spell-check, have been integrated into QDAS for a number of years, perhaps due to their inordinate power, the recent implementation of large language models (LLMs) to perform tasks such as document chat, automated summarization and AI-assisted coding has raised new questions about the use of such tools in social research. The questions raised span the gamut from how consolidation of LLM-assisted technology and qualitative research using QDAS affects the independence of researchers (Chubb, 2023; Hayes, 2025) to how it affects the integrity of their data (Davison et al., 2024; Hitch, 2024; Morgan, 2023). Consequenly, data security concerns have led some institutions of higher learning to deactivate the use of add-on AI features to QDAS or to disable them from the outset (Schneijderberg, 2024).
In a climate of uncertainty, researchers who have the opportunity to work with AI bear a responsibility to safeguard against any breaches of data protection. This begins with carefully vetting AI service providers for their compliance with relevant data protection laws—for example, MAXQDA’s adherence to the General Data Protection Regulation (GDPR) in the European Union (VERBI Software, n.d). Additionally, it is incumbent upon researchers—as was done in the present study—to rigorously anonymize all data prior to its use with AI-assisted technologies (Hitch, 2024; Kuckartz & Rädiker, 2024; Morgan, 2023). As the adoption of AI in qualitative research becomes more widespread, these ethical responsibilities go hand in hand with the need for clear methodological guidance.
Not surprisingly, recent discussions around the coding of qualitative data have focused heavily on the development of general guidelines for the use of AI-assisted coding (Christou, 2024; Friedman et al., 2024; Hitch, 2024; Kuckartz & Rädiker, 2024; Morgan, 2023; Tai et al., 2024; Williamson et al., 2025). The potential benefits—such as greater cost-effectiveness, accelerated coding processes, and improved inter-coder reliability (Hamilton et al., 2023; Nicmanis & Spurrier, 2025)—coupled with the continued popularity of qualitative methods like content analysis and thematic analysis, in which coding plays a central role, make increased reliance on AI-driven technologies for automated coding highly likely (Christou, 2024). Indeed, AI-assisted coding may well become the first choice for resource-constrained researchers (Williamson et al., 2025), particularly for increasingly large datasets (Chubb, 2023; Hitch, 2024; Nicmanis & Spurrier, 2025).
Beyond data protection, a primary concern has been the reliability of AI in the coding of qualitative data, thus much of the discussion has revolved around the performance of AI-models compared to human coders. In the effort to develop general guidelines for AI-assisted coding, a broad consensus has emerged in favor of a hybrid approach that combines the efficiency of AI with the interpretive expertise and critical judgment of human researchers (Anakok et al., 2025; Christou, 2024; Hayes, 2025; Hitch, 2024; Kuckartz & Rädiker, 2024; Morgan, 2023; Tai et al., 2024; Williamson et al., 2025). A recurring theme in this body of work is the understanding of AI not as a replacement, but as a tool to augment human analysis. Morgan (2023), while acknowledging that ChatGPT challenges the dominance of manual coding, stresses that such tools must be used properly to yield meaningful results. Kuckartz and Rädiker (2024) echo this sentiment, arguing that far from rendering social science methods obsolete, AI forces researchers to sharpen their questions and engage more rigorously with their methods in order to generate meaningful and usable results. Similarly, Christou (2024) describes AI as a complementary aid to human analytical inquiry, and Anakok et al. (2025) affirm that a hybrid approach shows the most promise, especially in light of persistent algorithmic biases.
Several researchers underscore the necessity of human oversight. Nicmanis and Spurrier (2025), despite confirming the feasibility of using LLMs for deductive coding, urge researchers to “thoughtfully integrate AI-assisted data analysis methods with their research values” (p. 12). Hitch (2024) echoes this point, advocating for AI-assisted reflexive thematic analysis only when applied “in a mindful and critically aware manner” (p. 604). Tai et al. (2024) emphasize that LLMs may streamline qualitative analysis but insist that interpretive authority must remain with researchers. Some scholars also raise concerns about the risk of AI being applied beyond its appropriate scope by researchers. Williamson et al. (2025), for example, support the use of AI as a confirmatory aid in coding, but caution against replacing independent human coders. They argue that human analysis must remain the “check and balance” to the accelerating pace and appeal of AI-driven methods (p. 11). Yet other research extends this caution, suggesting that AI-assisted coding may not be suitable for every research domain—for example, in disability research, where Friedman et al. (2024) found that LLM’s lacked the quality, credibility and rigor of human coding.
Previous research on deductive coding performed by the web-based AI option ChatGPT includes Morgan (2023) and Tai et al. (2024), both of which compare human coding against AI coding in pre-defined categorization tasks. Such studies have consistently demonstrated AI’s superior efficiency in terms of speed; however, certain findings suggest that its ability to identify subtle interpretative themes, nuanced meanings, and emotional subtleties is less developed compared to its strengths in recognizing concrete descriptive themes (Christou, 2024; Morgan, 2023). While Morgan (2023) discusses the potential use of LLMs in QDAS, neither his study nor the other comparative studies referenced in this paragraph above employed LLMs embedded within QDAS; instead, all relied on web-based AI tools for their comparative analyses. This distinction is important, as QDAS-embedded LLMs—even when based on the same underlying models—may employ standardized back-end or system prompts to ensure consistency. Such prompts can shape how AI interprets and processes data and make key operational decisions before any user request is even processed.
One study that explicitly engages with QDAS-based implementation is Williamson et al. (2025), who examine the use of LLMs within the QDA software ATLAS.ti to establish a practical framework for integrating AI-generated coding with human analysis. However, their study is limited to ATLAS.ti’s more inductive “AI Coding” feature, as their perspectives on the newer, more deductive “Intentional AI Coding” feature are still under development. Moreover, despite this distinct use of QDAS, Williamson et al.’s implementation of AI-assisted coding—like the other studies discussed above—relies on ChatGPT as the underlying model. Although QDAS platforms may shape how LLMs are implemented—such as through standardized back-end prompts—the reliance on ChatGPT across these studies reflects a notable degree of methodological uniformity. A noteworthy exception are Friedman et al. (2024), who utilize Google’s Gemini alongside ChatGPT to generalize their findings about the limitation of LLMs in qualitative research.
Against this backdrop, the present study examines a different configuration of AI-assisted coding. Specifically, it explores AI-assisted deductive coding within the QDA software MAXQDA, which currently utilizes LLMs through AWS Bedrock, employing Claude Haiku and Claude Sonnet from Anthropic for summarizing, chatting, and coding (VERBI Software, n.d). Using a theory-driven framework derived from the Centrality of Religiosity Scale (CRS), we offer a structured evaluation of the two essential features of this process—its inputs (coding instructions) and outputs (AI-generated codes)—as assessed in the context of narrative interview data on religious turning points. To achieve our objective, we will first provide some background on qualitative data analysis, coding and AI-assisted coding in QDAS, along with some clarification of foundational terms relevant to this study. We will then present our methodology, including the context of our data, our epistemological positioning, a clarification of specific methodological terminology used throughout, and our analytical procedure (including statistical methods used for evaluating AI-human coding correspondence). Finally, we present our findings and discuss their implications for the use of AI assisted deductive coding, particularly within religious studies and similar domains.
Background: AI-Assisted Coding in QDAS
Qualitative Data Analysis and Coding
Before introducing AI coding in MAXQDA, we will give a brief overview of qualitative data analysis and clarify some fundamental terminology. In what follows, a “code” is a tag or label assigned to part of a text (referred to below as a “text segment”) reflecting a particular theme or attribute of that part of the text. A “coding category” is a classification for the thematic grouping of text segments that have been assigned a code (referred to below as “coded segments”). As Kuckartz and Rädiker (2024) write, “The focus of qualitative analysis is on categories, which are used to code all the material relevant to the research question(s). Categories can be formed deductively, inductively or deductively-inductively” (p. 39, translated by the authors). Deductive coding is thus distinguished in that text segments are assigned codes and sorted into coding categories based on a preconceived framework or theory, while inductive coding proceeds without a framework to fit codes into, only developing coding categories in the process of coding. A deductive-inductive approach, on the other hand, combines elements from both deductive and inductive coding. One example of such a hybrid approach—as employed in the research project from which the data for the present study derives—gives in vivo code names to segments that were previously already assigned to deductively conceived categories, and, in turn, inductively builds sub and superordinate categories within each deductively conceived category.
In the traditional process of deductive coding, coders rely on a pre-conceived set of guidelines or rules referred to here as a codebook: This document will contain specific instructions for how each coding category is to be coded. Coders systematically mark text segments and assign them to their relevant coding categories. This process, sometimes likened metaphorically to bees in the field gathering honey and bringing it back to their hive, is laborious and time consuming; yet, such fragmenting of the text is seen as an indispensable step toward the organization of data in content analysis, thematic analysis and other forms of qualitative analysis (Braun & Clarke, 2006; Mayring, 2019).
Introducing AI-Assisted Deductive Coding in MAXQDA
QDAS is purpose-built to provide a range of tools to help systematize the coding process, making it both more manageable and consistent. AI features are a further compliment to this already robust analytical environment, offering additional support in handling and interpreting qualitative data. While a more detailed walkthrough of both traditional (non-AI) and AI-assisted deductive coding in MAXQDA is available in Appendix A, we focus here on the broader methodological potential that AI coding introduces for researchers. In MAXQDA, the features of AI Assist include paraphrasing, summarizing, document or coded segment chat, suggestion of both thematic and interpretative codes for specific text selections, suggestion of subcodes, explanation of text selections, and, as of August 27th, 2024, the beta version of its AI Coding feature that is the focus of this article. 1 AI Coding is described as “a dedicated research assistant” that, if provided with coding instructions defining the coding parameters, assists in “carefully identifying relevant text passages and assigning them to specific codes” (VERBI Software, 2024). As mentioned above, currently MAXQDA utilizes LLMs through AWS Bedrock, using Claude Haiku and Claude Sonnet from Anthropic for its AI features.
On the surface, AI coding promises to come as a major boon for the qualitative analysis of data in QDAS. If such features function smoothly, an 80-min narrative interview that through manual means would ordinarily take up to 10 hours to code, could be coded in a matter of just a few minutes. Even accounting for additional time to manually validate AI codes, the overall coding time could be considerably reduced.
In what follows we elucidate the process of AI-assisted deductive coding using MAXQDA with focus on our two central research questions: (1) How should coding instructions in AI-assisted deductive coding be structured to optimize outputs when applying theoretical dimensions of religiosity to narrative interview data within QDAS environments like MAXQDA? and (2) How should AI-generated codes be leveraged to augment the traditional process of deductively coding qualitative data in religious studies using QDAS tools?
Methods
Data Source
The data for our present analysis stems from the ongoing Swiss National Science Foundation (SNSF) research project “How Does Religiosity Change?” (https://data.snf.ch/grants/grant/205047). The aim is to study individual changes in religiosity, highlighted by “religious turning points,” from a longitudinal perspective using a multidimensional model of religiosity. The study is based on 26 annual survey waves of the Swiss Household Panel (SHP), from 1999 to 2024, covering around 25,000 individuals. In 2021, SHP participants were asked about religious turning points and their willingness to discuss these in interviews. Subsequently, our project team at the University of Bern’s Institute of Empirical Research on Religion conducted 365 narrative interviews, each averaging 80 min in length. The resulting dataset constitutes a rare coupling of statistical data on the multidimensionality of religiosity and qualitive interviews, enabling novel insights into individual religious evolution throughout life, elucidating the impact of life events, gender, and the dimensions of religiosity.
Context and Outlook of Analysis
During qualitative analysis of interviews for the above SNSF study, a book project was conceived to explore the spiritual lifeworlds of the 95 participants in the study who are members of the Reform tradition in Protestant Christianity. Within the confines of this book project, our team began to explore options for incorporating AI Assist into the analysis of these interviews. This process led to a focused methodological demonstration aimed at systematically evaluating AI-supported deductive coding in MAXQDA. The demonstration underpins the analysis presented in this study. As researchers engaged in theory-driven analysis using MAXQDA, we approached this work from a pragmatic epistemological orientation, motivated by a practical interest in understanding how AI tools might be integrated into our established coding workflows. At the same time, we remain attuned to the need for interpretative sensitivity when it comes to complex issues such religious experiences. It is a central concern that the use of AI-tools does not dilute the essential meaning-making processes at the heart of such analysis.
With these considerations in mind, our project team began by converting the codebook originally developed for human coders into a set of coding instructions for AI. As, mentioned above, the deductive element of our analysis comprises six coding categories (referred to hereafter as “dimensions” of religiosity): ideology, intellect, experience, private practice, public practice, and consequences of religiosity in everyday life. These dimensions, in turn, derive from the religiosity model developed by Charles Glock (1962), later refined in Stefan Huber’s CRS (Huber & Huber, 2012). For each dimension the resulting coding instructions detail the following instructional components: Thorough guidelines for how to apply the code, examples of text segments that correspond to the code, possible key words, guiding questions for coding, exclusions for what should not be coded and possible superordinate categories within the code.
Subsequently, we commenced systematic testing of MAXQDA’s AI Coding feature through a two-phased comparative analysis designed to both test the inputs—coding instructions—and the outputs—AI-generated codes—of the AI-supported deductive coding process. The goal was, on the one hand, to refine and improve our coding instructions and thereby to increase the quality and relevance of the AI-generated output, while, on the other hand, analyzing the outputs to assess how effectively AI-assisted deductive coding can be integrated into our existing coding workflows. This gradual process of refinement transpired over the course of 101 cycles of AI coding, the analysis of which forms the basis of the present study.
Comparative Analysis
Sample
The sample in our analysis comprises six interviews (cases A-F) that had previously been manually coded within the core project team. The first selection criterion was the greatest possible variance in relation to the centrality of religiosity, measured using the seven-item version of Huber’s Centrality of Religiosity Scale (CRSi-7). Cases ranged from low (case F; CRSi-7 score = 2.2), medium (cases B,C and E; CRSi-7 scores = 3.4, 3.3 and 3.6) and high religiosity (cases A and D; CRSi-7 scores = 4.6 and 4.0), reflecting this intended variance (Huber & Huber, 2012). In addition, attention was also paid to variance in terms of content by including cases in which alternative forms of religiosity (e.g. Zen meditation) play an important role. These sample cases were utilized for a series of comparative analyses, first with attention to our inputs—to ascertain responsiveness of AI to different levels of complexity of our coding instructions, and second with attention to AI’s outputs—to measure AI-coded segments against manually coded segments in terms of their scope, granularity and correspondence. While some cases served a more central role in our analysis, others were used for additional verification as supplementary or fallback cases.
Key Methodological Terminology: Complexity, Scope, Granularity and Correspondence
Before we can present the two phases of our study it is necessary to clarify some key terminology that will appear throughout what follows.
Coding Instruction Complexity
In the context of this study, we define “coding instruction complexity” as the number and combination of instructional components provided in the coding instructions given to AI for deductive coding. The highest complexity instructions provide the most support for interpretation by AI (see Figure 1), whereas medium and low complexity instructions provide progressively less of an interpretative framework. In the tests that follow, we have defined four levels of coding instruction complexity—see phase one below for more detail. The aim will be to determine whether high complexity instructions result in more targeted or accurate coding, or whether more streamlined instruction potential yield better results. An example of the highest level of coding-instruction-complexity, in this case for the dimension of ideology
Segment Scope
To gain a more nuanced understanding of AI’s performance, we differentiate in our study between segment scope and code granularity (see next definition). As defined in this study, any AI-coded segment that overlaps with human coding can be broader, narrower, or identical in scope. Segments are considered broader in scope if they either (a) overlap with multiple manually coded segments, or (b) include additional text beyond what human coders identified within a single segment. Conversely, narrower segments are more tightly defined than their human-coded counterparts—either capturing the essential thematic element in fewer words, or representing only a partial match if key thematic content is missing. Identical segments have precisely the same boundaries as manually coded segments.
Code Granularity
While the previous definition addressed segment scope—that is, how much text an AI-coded segment covers compared to human coding—here we shift focus to “code granularity”, or the level of thematic differentiation within a segment. Specifically, code granularity (or fine-grained coding) refers to the extent to which distinct thematic elements are identified and segmented separately within a given coding category. A more fine-grained coding output would then isolate each distinct thematic element, while a less fine-grained or coarser output would tend to cluster them together.
An example should help drive home the distinction. Consider the following passage coded by AI for the dimension consequences of religiosity in everyday life, which is coded when an interviewee describes rewards or sanctions experienced in everyday life as a result of their religiosity: Yes, prayer is very helpful to me every day. For my husband, my children, my whole family, and for my fellow human beings who have worries. It is also important to me to truly put God first every day, which is not always easy. Because most of the time, my husband and children come first. But I have simply noticed that my life has become simpler since I professed my Christian faith. That it is simply more fulfilling. And yes, that I have more meaning in my life.
While AI correctly identified this passage for the positive relevance of religion in the everyday life of the interviewee, its selection was broader in scope than what was coded by human coders—the first four sentences were not coded in manual coding for this dimension but were instead coded as private practice and ideology. In further contrast, the last three sentences of the same passage were segmented by human coders into three distinct, thematically different modes of the positive relevance of religion in everyday life: 1. “But I have simply noticed that my life has become simpler since I professed my Christian faith.” 2. “That it is simply more fulfilling,” 3. “And yes, that I have more meaning in my life.”
Segment one emphasizes the relief associated with a simpler life, segment two highlights the experience of greater fulfillment, and segment three conveys a heightened sense of meaning. A more fine-grained coding output would then account for such multiple modes of a particular dimension within the same passage and code each mode separately.
Correspondence Between AI and Human Coding
In this study, the term “correspondence” is used to refer to the degree of alignment between AI-generated and human coded segments in terms of overlap between AI coded and manually coded segments. This overlap can be partial or exact but can also occur when multiple manually coded segments fall within an AI-coded segment. For example, in the segment just cited above, all three human coded segments that fall within the longer AI coded segment would be counted toward a total tally of manually coded segments contained within the segments generated by AI. These counts are then used to calculate the percentage of correspondence between AI and human coding for any given dimension of religiosity. For a precise explanation of how correspondence percentages are calculated, see the next subsection.
Exploratory Trials and Statistical Methods
Before initiating the present study, the project team conducted an exploratory trial run using AI coding on a single case (case A), selected for its status as a theoretically typical narrative of a highly religious Reformed Christian. In this initial test, we observed that AI generated substantially fewer coded segments than human coders across all six dimensions of religiosity. At this early stage, correspondence between AI and human coding was calculated purely numerically—by comparing the number of coded segments generated by AI to the number identified through manual coding. On average, AI coded 58.7% fewer segments than human coders.
To investigate whether this discrepancy could be reduced, we ran 20 additional cycles of AI coding using coding instructions containing extra directives for more detailed, fine-grained coding; however, results remained mixed, 2 and after 20 cycles we reached a saturation point where further testing yielded no significant improvement. This led to two important insights. First, segmentation of longer passages coded by AI that contain multiple codable segments should be done manually or through more targeted additional AI prompting, not through additional language in our coding instructions. Second, and more material to the design of the present study, we concluded that for more meaningful test results, any further testing should rely on a more precise metric for correspondence based on the degree of segmental overlap between AI and manually coded output, not merely on the number of segments produced.
To calculate correspondence between AI and manually coded segments, we used a coverage-based metric focused solely on the extent to which AI-coded segments overlapped with manually coded ones. Specifically, we assessed what proportion of human-coded segments were also captured—fully or partially—within AI-coded segments. Importantly, our calculation does not penalize for AI-generated segments that do not align with manual codes; those new segments are evaluated separately (see next paragraph). For example, in the AI coding cycle shown in Figure 2, 29 segments were manually coded (not shown in the figure). Of these, 19 were captured within AI-coded segments. This results in a correspondence rate of 65.5% for that cycle (19 of 29 segments). An example of a table showing AI coded segments and their overlap with manually coded segments (segments not coded during manual coding are highlighted in blue). The box above the table shows a codeline generated by MAXQDA, visualizing the overlap between AI and manual codes
In measuring the correspondence rates between AI and manually coded segments, a satisfactory result would show a correspondence of 60%, a good result 70% and a very good result 80%. These benchmarks follow a precedent set by Landis and Koch (1977), who introduced similarly arbitrary but practically useful thresholds for discussing agreement in categorical data.
As noted above, AI-generated segments that do not align with manual codes (“novel segments”) were excluded from our main correspondence metric. However, for the 63 cycles of AI coding in phase two of our study we provide a count of the total number of these segments and an average of how many such segments were generated per coding cycle. Furthermore, to assess their potential relevance, we qualitatively evaluated a random sample of 16 AI coding cycles (25% of the total). Each segment was independently reviewed by a team member and judged as either valid or invalid based on the original codebook. Segments were considered valid if they clearly aligned with the definition outlined in the codebook for a given dimension. For a few particularly ambiguous segments, assessments were discussed by the broader team in colloquium sessions to ensure agreement with our coding standards.
Phase One: Input Analysis
The first phase of our analysis was to conduct a series of tests to measure the effect that adjustments in our coding instructions (the inputs) have on the AI-generated output. As noted earlier, initial exploratory trials suggested that adding supplementary directives—such as prompts encouraging more detailed coding—produced mixed results when those directives were not directly tied to the coding objective of identifying dimensions of religiosity. This raised a broader question: how should the components of coding instructions be structured to optimize output quality? While it might be anticipated that high complexity instructions, offering more interpretative guidance for AI, would result in more targeted or accurate coding, there is also the risk that too much complexity could overwhelm AI, leading to decreased performance and inconsistencies. These considerations led us to ask: What effect does the overall complexity of a coding instruction have on the quality of the AI-generated output?
To answer this question, we tested four levels of coding-instruction-complexity using coding instructions for two dimensions of religiosity (experience and public practice) on cases A and B—case A being highly religious, and case B of medium religiosity, representing an alternative form of religiosity. The four levels of coding-instruction-complexity we tested—each defined by a distinct combination and number of instructional components—were: (1) high complexity, which included thorough coding guidelines, example quotations, key words, guiding questions for coding, coding exclusions and possible superordinate categories for the code (see Figure 1); (2) medium complexity (no examples), which omitted example quotations; (3) medium complexity (no questions and categories), which omitted guiding questions for coding and possible superordinate categories; and (4) low complexity, omitting guiding questions for coding, possible superordinate categories and example quotations. We then validated our findings running further tests on cases A and B using two variations in coding-instruction-complexity for the four remaining dimensions of religiosity (ideology, intellect, private practice and consequences of religiosity in everyday life). Finally, we ran a further test on a third case (case C, of medium religiosity) using the most promising variation of coding-instruction-complexity for all six dimensions of religiosity, for a total of 38 cycles of AI coding in phase one.
Phase Two: Output Analysis
We agree with Christou (2024) that one possible method of validating AI-generated output is a comparative analysis alongside manual analysis. Thus, in the following assessment of the output of AI-assisted deductive coding, our focus is on comparing text segments coded by humans and those coded by AI. This comparison considers the correspondence between segments, differences in scope and granularity, as well as the frequency and validity of novel AI-coded segments. Specifically, we address the following questions: (1) What percentage of manually coded segments correspond with AI coded segments? (2) How does the scope of AI-coded segments compare to the scope of the manual codes that overlap with them? (3) What percentage of AI-coded segments are less fine-grained than manual coding—meaning they encompass multiple manually coded segments—and how many manual codes do they contain on average? And (4) How many segments does AI coding generate that were not manually coded and how likely are such new segments to be valid?
To address these questions, we evaluated the results of the comparative analysis conducted in the process of the further refinement of our coding instructions for the six dimensions of religiosity. In total the results of 63 cycles of AI-coding—run until coding saturation was reached—conducted on six interviews (cases A-F) were compared with the results of manual coding. Our evaluation focused on comparing AI coded segments to manually coded segments in terms of correspondence—that is how much AI and human coding overlaps—and in terms of the precision of the marking of segments (i.e., identical, broader or narrower in scope). In cases where broader AI-coded segments encompassed multiple manually coded segments, we also documented how frequently this occurred and calculated the average number of manually coded segments contained within each such AI-coded segment, as a measure of the reduced granularity of AI output. Additionally, we also recorded the number of new segments coded by AI that were not coded manually and evaluated the validity of a sample of these newly coded segments. Figure 2 reproduces the table used to systematically document these metrics for each round of AI coding.
Limitations
A key limitation of the present study is that only limited testing was possible for the categories of ideology and intellect in phase two of our study. Despite numerous attempts, AI-Assist was largely unable to code these dimensions of religiosity at the time of testing. While a few AI-generated codes were produced, most AI-coding requests for these categories timed out after 5 min, resulting in the error message “Currently unavailable.” Manual coding of ideology and intellect consistently yielded the highest number of codes, which suggests that the complexity of these dimensions may have overburdened AI-Assist. This suspicion was confirmed in consultation with MAXQDA’s support team, who expressed confidence that future updates to AI-Assist would address these deficiencies (Support staff at MAXQDA, personal communication, February 5–19, 2025). 3 As a result, the tests in phase two measuring the correspondence between manual and AI coding are incomplete for these dimensions, with only one result available for intellect and four for ideology (indicated below with an asterisk in each case). This limitation restricts the conclusiveness of our findings for these areas.
It is important to acknowledge an inherent causal ambiguity in testing the efficacy of coding instructions for AI-assisted deductive coding. The efficacy of the coding instructions to elicit the desired result and the ability of AI to generate the desired result are two interrelated factors that are always both at play, where the weight of each factor on the quality of the outcome is not always readily apparent. If an adjustment of coding instructions results in a change in the AI-generated output, it can thus never be definitively said whether that change is attributable to the adjustment of coding instructions or a change in the ability of AI to generate the desired result. This attribution challenge is compounded by the lack of transparency as it relates to the evolution of LLMs and the fact that programs like MAXQDA could change their LLM models to improve the quality of AI generated content at any time (VERBI Software, 2025b). However, MAXQDA strives for consistency and a minimum of variance in AI outputs; this means that across multiple requests made of AI with the same coding instructions, except for minor deviations, the output remains mostly stable (VERBI Software, 2025a). This stability provides some confidence in the reproducibility of our results despite the evolving nature of integrated AI tools.
Another limitation concerns the testing approach during the process of the further refinement of coding instructions. Interview cases and dimensions that show lower correspondence between manual and AI-coded segments are typically tested more frequently than those that perform well from the start. As a result, an overall average of correspondence that does not account for the variation across cases and dimensions may present a skewed picture. To address this, we report not only the overall average correspondence but also averages broken down by interview case and by dimension. Additionally, to mitigate the potential distortion caused by repeated testing of underperforming coding instructions, we also report averages based solely on the final tested version of the coding instructions for each case and dimension.
Finally, we must acknowledge that this study does not offer a comprehensive assessment of the validity of novel AI-coded segments that do not overlap with manually coded segments. While the review of 16 AI coding cycles (25% of the total) gives a general indication of the number of new segments that can reasonably be considered valid, as well as their proportion within the full dataset, further testing is required to draw more definitive conclusions about the overall validity of such novel AI-coded segments.
Findings
Phase One: Input Analysis
What effect does the overall complexity of coding instructions have on the quality of the AI-generated output?
We measured what percentage of manually coded segments corresponded to AI-coded segments across the dimensions of “experience” and “public practice” on cases A and B using four levels of coding-instruction-complexity. The results showed a gradual reduction of correspondence between manually and AI-coded segments as complexity decreased. While the most complex coding instruction yielded a correspondence of 79.1%, medium complexity (no examples) yielded 74.4%, medium complexity (no questions and categories) 72.5% and low complexity 71%. We then tested high and low coding-instruction-complexity on the remaining four dimensions of religiosity for cases A and B and again measured the percentage of manually coded segments that correspond to AI-coded segments. While the high complexity coding instructions yielded 73.2%, low coding-instruction-complexity yielded 70%; it should again be noted that these results are incomplete, as the dimension of ideology could not be AI-coded for case A at low coding-instruction-complexity due to a pervasive problem with MAXQDA’s AI Coding (beta) at the time of testing (see Limitations section above). A final test that measured the percentage of manually coded segments corresponding to AI-coded segments across all dimensions for case C using the most complex coding instructions, yielded a result of 62.6%.
Phase Two: Output Analysis
Over the course of 63 cycles of AI coding, a total of 920 AI-coded segments were generated of which 482 (52.4%) overlapped with text segments containing manual coding. As indicated above, AI-generated codes are measured against manually coded segments by examining three distinct aspects: the overall degree to which AI and manual codes match (correspondence), how well the segments align in terms of their textual boundaries (segment scope), and the level of thematic detail they capture (code granularity). The 438 (47.6%) AI-generated segments that do not overlap with any manually coded segments are addressed separately below (see question #4). 1. What percentage of manually coded segments correspond with AI coded segments?
The overall average of the correspondence between manually and AI-coded segments across all interview cases and dimensions of religiosity was 64%. When only the final test for each case in each dimension is considered, the overall average comes to 66.5%. The overall average of correspondence for each individual dimension breaks down as follows: ideology 73.6%*, intellect 50%*, experience 69.3%, consequences of religiosity in everyday life 69.8%, public practice 46% and private practice 72.6%. When only the final test for each case in each dimension is considered, the averages break down as follows: ideology 73.6%*, intellect 50%*, experience 58.3%, consequences of religiosity in everyday life 76.3%, public practice 53.4% and private practice 75.9% (see Figure 3). The overall average of correspondence for each individual case breaks down as follows: case A 78.9%, case B 53.4%, case C 65.9%, case D 67.5%, case E 60.2% and case F 48.6%. When only the final test in each dimension for each case is considered, the averages break down as follows: case A 85.2%, case B 56.7%, case C 63.9%, case D 77%, case E 59.7% and case F 53% (see Figure 4). The comparatively low correspondence in case F is partially attributable to the dimension of experience where only one code was assigned manually which AI did not detect resulting in a correspondence value of 0%. Without this test result the average correspondence for case F overall would be 60.8% and considering only the final test in each dimension 70.5% (see adjusted values in Figure 4). 2. How does the scope of AI-coded segments compare to the scope of the manual codes that overlap with them? Correspondence rates between AI-generated and manual codes across six dimensions of religiosity. Percentages show the share of manually coded segments matched by AI. Blue bars show results across all 63 cycles of testing; red bars show final tests only. Asterisks indicate dimensions for which only limited AI-coding tests were available (see Limitations section) Correspondence rates between AI-generated and manual codes across interview cases. Percentages show the share of manually coded segments matched by AI. Blue bars show results across all 63 cycles of testing; red bars show final tests only. The adjusted value for case F (F* (adj.)) accounts for an outlier in the experience dimension, where one missed segment drove the correspondence to 0%


Over the course of 63 cycles of AI-coding, 482 AI-generated segments overlapped with manually coded segments. Out of those AI-coded segments 88.6% were broader in scope than manual coding, 5.6% identical and 5.8% narrower in scope. 3. What percentage of AI-coded segments are less fine-grained than manual coding—meaning they encompass multiple manually coded segments—and how many manual codes do they contain on average?
Out of the total of 482 AI-generated segments that overlap with manually coded segments, 56.8% of segments are less fine-grained than manual coding, containing multiple manually coded segments. On average, these segments contain 3.2 manually coded segments, with a standard deviation of 2.2. 4. How many segments does AI coding generate that were not manually coded and how likely are such new segments to be valid?
Over the course of 63 cycles of AI-coding a total of 920 text segments were generated, 438 (47.6%) of which were new, meaning they did not correspond to any manually coded segments. This amounts to an average of 7 new segments per round of AI-coding, with a standard deviation of 5. A closer analysis of a random sample of 16 cycles showed that 20.7% of these new segments can reasonably be considered valid novel segments that were missed during manual coding. If this proportion is taken to be representative of the full dataset, it follows that approximately 347 segments (37.7% of all AI-coded segments) neither overlap with manually coded segments nor qualify as valid novel contributions and were therefore erroneously coded.
Discussion
Toward a Framework for AI-Assisted Deductive Coding in MAXQDA
Our findings echo the consensus in the broader research community that for AI to be used effectively in research, its implementation requires careful planning and deliberate integration into a well-defined methodological framework. Further, our findings reinforce the broader view that AI should not be understood as a replacement, but as a tool to augment human analysis. In the present study, such a hybrid approach was explored as a proof of concept for the practical integration of AI tools into preexisting workflows for the deductive coding stage of a multistage content analysis in the context of religious studies. While AI-driven technologies may also be applicable at other stages, each potential use must be individually assessed and justified, necessitating further research to explore and validate the broader use of AI in qualitative research in this field.
Our close analysis of both the inputs (coding instructions) and outputs (AI-generated codes) in AI-assisted deductive coding using MAXQDA demonstrated the potential of AI-driven technology to serve a well-defined support function while also underscoring the continued need for human oversight. While exploratory trials revealed that AI cannot reliably execute secondary tasks, such as implementing directives for more fine-grained coding, the first phase of our study showed that the highest complexity coding instructions—when narrowly focused on addressing the central research question—yielded the most consistent correspondence rates between AI and manual coding across our sample. This suggests that, despite limitations in handling secondary tasks, coding instructions should provide focused and comprehensive directives—in this case for coding the various dimensions of religiosity—thereby providing the greatest possible scaffolding to support AI’s interpretation.
Four key figures from the second phase of our study illustrate how AI-generated codes can be leveraged to support manual deductive coding using QDAS. First, 64% of manually coded segments were represented within AI-coded segments (or 66.5% when considering only the final test for each case and dimension). Second, 88.6% of AI-generated segments were broader in scope than the manual coding with which they overlap. Third, 56.8% of AI-generated segments were less fine-grained than the manual coding with which they overlapped, containing multiple manually coded segments. Fourth, 37.7% of AI-generated segments neither overlapped with manual coding nor qualified as valid novel segments.
The first key figure—showing that 64% to 66.5% of manually coded segments were represented within the AI-generated codes—in our view suggests a sufficient level of correspondence for AI outputs to be usefully integrated to support deductive qualitative coding; however, the level of correspondence is not sufficient to replace human coding. Correspondence varied both across different dimensions of religiosity and across different interview cases, indicating that AI’s effectiveness can fluctuate both due to the specific aspects of each dimension and the idiosyncrasies of each individual interview. Instead, AI can augment traditional deductive coding by offering a preliminary layer of suggestions, which researchers can then critically evaluate and refine during manual coding—potentially streamlining the coding process without compromising analytical rigor.
The second, third and fourth key figures in our findings provide key insights for what human researchers should prioritize in their evaluation of AI generated codes and therefore help further define the potential of AI-integrated coding within preexisting workflows for deductive coding in QDAS. The second and third figures show notable differences in segment scope and code granularity between AI and manual coding, clearly indicating that a major task for researchers currently working with AI-generated output in MAXQDA is to break down these broader segments into smaller, codable units. Here the distinction between segment scope and code granularity is useful in that our results give us an indication that post-processing of AI-codes by researchers need not just focus on tightening segment boundaries (with 88.6% of AI-coded segments being broader in scope), but also on thematic differentiation (as 56.8% of AI-coded segments are less fine-grained). Both metrics reinforce the point that AI-generated segments should not be accepted wholesale in manual coding; each segment must be critically evaluated, with adjustments made to scope and granularity as needed. The fourth figure shows that 37.7% of AI-generated segments neither overlapped with manual codes nor qualified as valid novel segments. This highlights the need for researchers to be prepared to discard AI-generated codes that do not align with coding guidelines, although, as noted in our Limitations section above, further research is warranted for a more conclusive understanding of the overall validity of such novel AI-coded segments.
The Unique Challenge of Coding Religious Turning-Point Narratives
In closing, it is important to acknowledge the unique challenges that arise when analyzing data related to religious turning points. More than in other domains of experience, the object of religious experience challenges explanatory faculties; religious experience is routinely framed as transcategorical (Hick, 2000) or as beyond “all positive substantial characterization in human thought and language” (Ho, 2006, p. 409). Accounts of such experience are often context specific and metaphorical and ambiguous in content, requiring cultural sensitivity and interpretative nuance. While religious narratives present additional challenges for both human and AI interpreters, they may present a particular challenge to LLMs which tend to perform best with consistent, literal language patterns.
Despite significant effort in crafting coding instructions designed to function across denominations and levels of religiosity, our findings show considerable variation in the correspondence between AI and manual coding across both dimensions of religiosity and individual interview cases—even within the relatively homogeneous context of Reformed Christians in Switzerland. For instance, the two traditionally Reformed cases with high CRSi-7 scores (Cases A and D) yielded high correspondence rates of 85.2% and 77%, whereas two medium cases with more alternative orientations (B and C) showed markedly lower correspondence rates of 56.7% and 63.9%. This pattern suggests a tendency for more highly religious and religiously traditional cases to achieve higher correspondence rates. Given the complexity of the narratives in our dataset, it is reasonable to assume that other, less demanding datasets could yield even more stable and higher correspondence rates than those achieved in the present study.
Conclusion
This study offers a structured evaluation of AI-assisted deductive coding within the context of religious studies. As researchers engaged in theory-driven analysis using MAXQDA, we are motivated by a practical interest of how AI-assisted deductive coding might effectively be integrated into established workflows. Our examination of the new automated deductive coding features in MAXQDA showed that highly complex coding instructions—giving focused and comprehensive directives to support AI’s interpretation—yielded the most consistent correspondence rates between AI and manual coding. While AI coding performed sufficiently well in its correspondence to manual coding to serve as a useful reference for human coders, it was less precise in delineating segment scope and granularity and tended to generate a considerable number of invalid novel segments. Overall, these findings support the current consensus in the wider research community that favors a hybrid approach to the use of AI in qualitative research, where human judgement remains the final arbiter both in the segmentation of data and in determining the validity of coded segments. Although AI coding provides a helpful preliminary layer for human coders, a process that effectively integrates AI-assisted coding requires post-processing to tighten segment boundaries, increase thematic differentiation and discard AI-generated codes that do not align with coding guidelines. Due to the often metaphorical nature of religious narratives, this study’s focus on interviews about religious turning points presents an additional challenge, as LLMs tend to perform best with consistent, literal language patterns—potentially leading to weaker results compared to those achievable by the same methods in other domains. Additionally, due to technical limitations during this study, AI coding could not be fully tested for the densely coded dimensions of ideology and intellect, which restricts the conclusiveness of our findings in these areas. Addressing these challenges in future research—potentially through improved AI capabilities—will be essential to fully realize the potential of AI-assisted coding for the empirical study of religion and similar domains.
Footnotes
Ethical Considerations
We submitted an ethics application to the Ethics Committee of the Canton of Bern. The Ethics Committee determined that the project does not require approval (BASEC number: Req-2022-00065; date of submission: 21/01/2022).
Consent to Participate
Participants initially confirmed their consent during each participation in an SHP survey by entering their login information and clicking to proceed, thereby acknowledging that they had read the information about the survey sent by post and agreed to take part. They also confirmed that they consented to their data being made available for research and teaching purposes. This consent was reiterated in the letter inviting them to participate in the narrative interview. At the beginning of the narrative interview recording, participants were again asked whether they agreed to the recording and scientific analysis of the interview. The interview was only continued if this consent was explicitly given.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Swiss National Science Foundation [grant number 205047].
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Due to the sensitive and confidential nature of the qualitative interview data collected in this study, the data cannot be shared publicly. However, the anonymized data will be archived and made accessible under restricted conditions through FORS—the Swiss Centre of Expertise in the Social Sciences (
).
Notes
Appendix
This appendix provides a walkthrough of both traditional (non-AI) and AI-assisted deductive coding in MAXQDA, supplementing the introduction to AI-assisted deductive coding in the section 'Background: AI-Assisted Coding in QDAS'. Although the focus here is on MAXQDA, other QDAS platforms such as ATLAS.ti and NVivo function similarly, especially as it pertains to non-AI coding. The process begins with importing the document to be coded and creating containers that correspond to coding categories, which then store the coded segments assigned to them. The container for each coding category is given a name (e.g., ideology, intellect, experience, etc.) and a unique color-code. Coders highlight text passages and assign them to their relevant categories, transforming them into coded segments; for example, in the coding of an interview, the text segment “religion only leads to war and dogmatism” might be assigned to the container “ideology” that holds all ideologically laden statements made by the interviewee. Once a text segment has been assigned to a particular container, brackets corresponding to the container’s color-code will appear in the left margin of the text document being coded, indicating the presence of a coded segment; when clicked on, these brackets highlight the corresponding coded segments in the same color. Coded segments are grouped in their containers either as codes without a name (i.e., only identified by their color and the container name) or individually named (for example with in vivo code names) as subcodes within the container.
With MAXQDA’s AI Assist, a text of up to 120,000 characters can be coded for any given coding instructions with just a few mouse clicks. Provided with precise coding instructions for any given coding category, AI Assist will suggest a series of coded segments throughout a text and collect them as a new subcode in the container of that coding category. For every AI-coded segment, AI Assist will provide reasoning for its selection in the form of a comment in the right margins of the document being coded. Additionally, all AI-generated content is distinctly color coded with a unique color—both within the container and in the text window—to ensure the distinguishability of AI-coded segments from manually-coded segments.
Once a document has been fully coded, MAXQDA allows for easy shifting of subcodes within a given container, facilitating the development of sub- and superordinate categories—a process that can support axial coding by helping to organize and relate categories more clearly. Much more can be said about the functionality of QDAS at every level of qualitative analysis, but, as our focus here is circumscribed to deductive coding, a comprehensive overview lies beyond the scope of this paper.
