Abstract
How has the field of virtual reality (VR) evolved and what type of research has made an impact? We used natural language processing techniques and generative artificial intelligence to develop the most complete review of experimental social science VR research to date (1992–2024). From a collection of 21,195 abstracts written by 52,543 unique authors, 13 reliable themes emerged over time, with immersive experiences receiving the most recent attention. Interdisciplinary teams were cited more than less interdisciplinary teams, and watershed moments like mainstream industry embracing VR (i.e., Google Cardboard’s release) correlated with changes in scholars’ research focus. Based on such available data, we observed that more than half of all articles over the past 30 years have been published in the last 6 years. Our database—the VRbalARchive—is publicly available, helping scholars investigate VR’s history and enhancing our theoretical understanding of the medium.
Introduction
Virtual reality (VR), and similar technologies like augmented reality (AR) or mixed reality (MR), are often hard to define. Indeed, the person who coined the term, Jaron Lanier, in his recent memoir on the topic offered 52 different definitions of VR, 1 ranging from an “art form” to a “mirror image” to a “sensorimotor loop” to a “magic trick.” The proliferation of jargon, synonyms, or abstractions in the field reached such a crisis point that Mel Slater, a pioneering VR researcher, made a public call to other scholars in 2007 to move away from expanding terms and definitions, and to focus on substantive impact and findings. Nearly two decades ago, he declared “The days of endless debate about the ‘true meaning’ are over.” 2 Despite the strength of his argument, jargon and abstractions persist, and depending on the group of scholars or the journal publishing the work, readers can encounter dozens of terms that are meant to convey the same basic concept or a different one altogether (e.g., the metaverse and VR are often conflated). 3 Indeed, five commonly used labels for the medium do not have a single overlapping word: extended reality, immersive virtual environments, spatial computing, modeling and simulation, and Cave Automatic Virtual Environments (CAVEs), to name a few.
Word choice aside, the broader field also features subareas that are qualitatively different in terms of technological features and the user experience. For example, some systems completely eliminate light and sound from the physical world, while others augment real-time video or actual light from the world with digital content. Milgram and Kishino 4 provide an early discussion of this spectrum of virtuality, and recent scholars have further explicated these concepts.5,6 But, despite these real technological differences, when one returns to Sutherland’s 7 original description of this medium, all the various implementations and user studies focusing on VR, AR, and MR seem to fit the general description of “a looking glass into a mathematical wonderland.” When one focuses on applications of the technology, the details of the content of the simulation are often more important than the particular technological implementations. In this paper, we are using the term VR, which is most common both in communication research and in the vernacular, but our research scope extends the entire spectrum of virtuality.
The complexity of VR as a medium, and the terminology behind it, has consequences for scholars who seek to understand the field and summarize the technology. For example, the second author of this paper has dedicated much of his career to studying VR, but if one does a comprehensive search of abstracts using only the words “virtual reality,” over 80 percent of his research articles are missed. Indeed, systematic reviews are bounded by the keywords that underlie them, 8 and given the lack of overlap regarding VR terms that tend to be similar, new approaches are needed to be more inclusive in the evaluation of academic fields.
The current work seeks to remedy this and to provide a substantive account of the field of VR over time. We have four high-level goals in this paper. First, we build and validate the most comprehensive dataset of experimental VR (and similar technologies like AR and MR) articles to date. Second, we parse the field semantically to show how research on the medium has changed over time. Third, we evaluate the research characteristics that have led to some measures of impact, such as citation counts and authorship attributes (e.g., the type of institutions that authors come from, plus interdisciplinarity). Finally, we make the dataset available to other scholars and suggest new avenues for future research to better understand the medium. We accomplish these goals by computationally extracting themes in the literature and we use article-level metadata to draw inferences about the field’s composition.
Historicizing Academic Fields
The most common way to examine an academic field is through self-reflection facilitated by a literature review based on knowledge of the authors and search techniques. This is often targeted for a particular empirical task or set of relationships. When a field matures, however, articles often attempt to draw broader conclusions about its status.9,10 In search of such field-level trends, it is practically impossible to discover every study, over all time periods, across fields, and across journals for a specific interest. Recent advances in computational social science have made important contributions helping to alleviate some of the prior constraints related to the size (e.g., literature may be quite large), scale (e.g., literature may expand beyond one field), and speed of academic publishing (e.g., thousands of new articles are added to PubMed each day). 11
Indeed, articles can now present inclusive accounts of a field or subfield. Take, for example, a recent paper evaluating trends in nearly 100 years of communication science scholarship. 12 The authors obtained paper abstracts from 22 journals (n = 20,664 articles), modeling the top 50 themes in communication over time and tracking how those themes evolved. Other reviews have performed similar analytic processes, 13 demonstrating that perhaps the best way to understand a field is by evaluating it holistically. Until recently, this task has been a difficult and resource-intensive endeavor.
Against this backdrop, we used natural language processing techniques and large language models (LLMs) to evaluate thousands of articles for VR content in experimental studies. We seek to learn how the field has changed thematically over time, responded to events in industry, and if there are major social and organizational artifacts associated with its evolution. We propose the following exploratory research questions:
What are the dominant themes in VR research over time? What social and organizational artifacts have contributed to VR’s impact?
Method
To evaluate trends in experimental VR scholarship, we used the OpenAlex database, 14 which catalogs over 240 million works across disciplines and we accessed via the openalexR package in R. 15 We queried OpenAlex in the following manner. First, we selected a start (1/1/1992) and end date (7/1/2024) of data extraction to cover the long history of VR and VR-related concepts for a historical review. We returned all results with at least one of the following terms in the full text, title, or abstract of each paper: “virtual reality,” “VR,” “immersive virtual environment,” “IVE,” “immersive virtual environment technology,” “IVET,” “augmented reality,” “HMD,” “head-mounted display,” “head mounted display,” “mixed reality,” “extended reality,” or “XR.” We excluded articles that contained the terms “desktop virtual reality,” “desktop VR,” or “Second Life” to avoid desktop-based applications of VR. Hence, we focus on headset-based VR. “AR” was not used to avoid confusion with an unrelated concept (a rifle) and “MR” was not used to avoid confusion with titles (e.g., Mr. Smith). This search collected a total of 915,328 publications with abstracts that we filtered for a more rigorous analysis.
Since an article might mention VR and not intentionally focus on it, we had LLMs identify if the 915,328 articles were topically about VR or related technologies. After excluding abstracts with non-English writing, we used two OpenAI models—GPT-4 (gpt-4-turbo) and GPT-4o (gpt-4o)—to code each abstract. We used the GPT-4 models because of their advanced reasoning skills relative to other LLMs. 16 The prompt in Supplementary Data S1 was provided to each LLM, which was developed by consulting seminal texts on VR,17–20 and asking Anthropic’s Claude AI (3.5 Sonnet) to refine the prompt.
The two LLMs coded all 915,328 cases and achieved substantial agreement (Cohen’s κ = 0.923; see Supplementary Data for additional validity checks). Discrepancies were resolved by gpt-4o-mini, which was given the same coding task, and two out of three models must have reached an agreement for articles to be retained. This created a subset of 196,734 articles that we then coded for those with an experimental focus. We accomplished this by creating a verbal dictionary of 19 experimental terms (see Supplementary Data S1) as a “first pass” to isolate articles with experimental terms (n = 45,179). Then, gpt-4o coded these 45,000 abstracts using a second prompt (see Supplementary Data S1) to ensure an experimental focus. A total of 21,195 abstracts comprised the final dataset (1992–2024), which we formally name the VRbalARchive: https://osf.io/j2uw8/. Abstracts were submitted to the Meaning Extraction Method,21,22 a topic modeling approach that formed themes for further evaluation (see Supplementary Data for details of the method and input criteria).
Results
Our database of 21,195 abstracts was written by 86,191 total authors (52,543 unique names) from 6,085 unique institutions, which were published in 3,702 unique outlets. The most frequent institutions and publication outlets are listed in Supplementary Data S1.
Themes from meaning extraction
Addressing RQ1, there were 13 reliable themes extracted from the corpus, which after discussion between the two authors, we categorized as: (1) educational VR, (2) randomized controlled trials (RCTs), (3) experimental terms, (4) AR, (5) head-mounted displays, (6) movement, (7) task performance, (8) statistical significance, (9) hypotheses, (10) paper sections, (11) training, (12) immersive experiences, and (13) experimental effects. Words underlying each theme are listed in Table 1 with component loadings in Supplementary Data S1.
Themes from the Meaning Extraction Method
Theme names are reflected in the first row of verbal descriptors, with terms reflecting each theme below them. For corresponding component loadings, please see Supplementary Data S1 out of space considerations.
Descriptive trends of such results indeed mark the history of experimental VR scholarship over time (Figure 1). For example, there has been an increased proportional focus on educational VR and AR, but less of a focus on task performance and HMDs. All bivariate correlations between themes and publication year were statistically significant at the 5 percent level except for movement (p = 0.526) and training (p = 0.118), with the largest correlation between publication year and immersive experiences (r = 0.145, p < 0.001).

Themes over time across experimental virtual reality (VR) and augmented reality (AR) abstracts. The right vertical bar represents June 25, 2014, the date VR began to mainstream in big tech, based on the release of Google Cardboard and the purchase of Oculus by Facebook. Trend lines were estimated using LOESS models. Numbers in parentheses on the x-axis are the cumulative number of articles published in experimental VR over time.
What contributes to VR’s impact?
Citation count
To address RQ2, we ran partial nonparametric correlations between each theme and citation count, controlling for publication year and Bonferroni-correcting the significance values by the number of statistical tests performed (n = 13). To highlight a few trends: articles that feature randomized controlled trials (ρ = 0.022, p = 0.016), task performance (ρ = 0.021, p = 0.030), immersive experiences (ρ = 0.047, p < 0.001), and experimental effects (ρ = 0.072, p < 0.001) tend to be cited more than articles that focus less on these themes.
Authorship attributes and effects
On average, articles contained 4.07 authors per publication (SD = 2.32; Mdn = 4.00) and Education was the most popular institution in the database (n = 57,652 authors). We examined how authorship interdisciplinarity was associated with extracted themes and citation count. To create an interdisciplinarity score for each paper, we converted the number of authors per paper from each of the eight institutions (e.g., Health care, Education) to binary scores (e.g., a paper had an author from Health care or not; this was repeated for all institutions). We summed scores across all categories, and average interdisciplinarity was 1.23 (SD = 0.48; Mdn = 1; min = 1, max = 5). Cases were excluded if articles did not have an author from at least one category. There was no significant relationship between interdisciplinarity and year (r = −0.012, p = 0.103).
After controlling for paper year in a partial correlation and Bonferroni-correcting the significance values, we observed that articles publishing on educational VR (r = −0.088, p < 0.001) and immersive experiences (r = −0.033, p < 0.001) tended to have less interdisciplinary teams. Articles publishing randomized controlled trials (r = 0.152, p < 0.001) and on training (r = 0.052, p < 0.001) had more interdisciplinary teams. Finally, controlling for year, more interdisciplinary teams have had their articles cited more than less interdisciplinary teams (ρ = 0.088, p < 0.001).
Does authorship identity relate to the type of VR research being conducted? We used the gender package in R 23 to approximate authors’ identity. The proportion of women and men, relative to the total number of authors per paper, was calculated (though, for nearly 34 percent of authors, no determination could be made). The rate of men was higher than women, but more recent scholarship featured a slight increase in the proportion of women over time (see Supplementary Data S1). Indeed, this finding expands previous analyses about authorship gender from the IEEE VR community to the large body of experimental social science research in VR. 24
Emergence of consumer VR
We also evaluated how one critical moment in VR’s history, the mainstreaming of lightweight HMDs like Google Cardboard (on June 25, 2014) and the Oculus Rift, which emerged in roughly the same time period, might have been associated with changes in VR’s academic history. Articles published after Google Cardboard’s release tended to focus more on immersive experiences (Cohen’s d = 0.324), AR (Cohen’s d = 0.206), and educational VR (Cohen’s d = 0.268) than articles published before its release. Articles published after Google Cardboard’s release also focused less on task performance (Cohen’s d = 0.092) than articles published before.
Trends in social science journals
Our final exploratory interest measured how experimental VR, and its respective themes, have been represented in social science journals at large. We drew on prior work that has identified major journals from the field,13,25 and supplemented this list with social science journals indexed by SCImago. A total of 303 social science articles were in our database. The journal Cyberpsychology, Behavior, and Social Networking had the most articles in this archive, accounting for nearly 20 percent of all social science articles (n = 62; see Supplementary Data S1; this number nearly doubles if you use the journal's prior name, CyberPsychology & Behavior).
Discussion
This paper provided a comprehensive analysis of VR research in over 30 years of academic publishing, offering insights into the evolution and current state of the field. We identified 13 distinct themes in experimental VR scholarship and traced their development. Our findings revealed significant shifts in research foci and authorship attributes over time.
There are several results from our work that deserve additional discussion. VR scholars have long suspected that interdisciplinary teams are critical. For example, to run a typical social science study, one historically needed to have a dedicated engineer on the team—the hardware and software issues were daunting, and high-impact social science work often required expertise in both engineering and psychology. Similarly, social scientists often engage with clinicians because many important research questions are based on emotion and well-being. Working with clinicians is a sound strategy given how intense and engaging the medium can be. Finally, good science often relies on opportunistic moments; social scientists will often work with industry to access the latest VR hardware. Given these unique aspects of the medium, it is unsurprising to see how interdisciplinary teams have a high impact.
Regarding the debate over terms (e.g., VR, AR), the current paper offers an alternative framing. While two of our categories are intrinsic to hardware—“augmented reality” and “head-mounted display”—most of them focus more on the research itself including the domain studied such as “education,” the experimental methodology such as “randomized controlled trials,” and sometimes on possible mechanisms, such as “movement.” In this sense, we provide a framework that moves past the surface-level features of the technology and shows underlying patterns about the research itself.
For those who have studied VR for decades, the years around 2014 felt like a paradigm shift as large technology companies released products (i.e., Google and Facebook) or covertly dedicated huge amounts of resources to the medium (i.e., Apple). Suddenly, VR became a household name, and hardware became cheap(er) and plentiful. It is fascinating to trace this moment in time, empirically, with inflection points in published research. The research trends seem to have adapted, with more work on educational applications that were the canonical use of Cardboard. Similarly, in 2014, the AR company Magic Leap received over half a billion dollars in funding from mainstream technology investors, and the research trends dovetail with this event, with research in AR increasing after 2014. Of course, there is no causal evidence for the direction of these links, but it is notable that the trends we map in academia also converge with landmark technological events.
Considerations and lessons learned
This study demonstrates the utility of applying large-scale, computational approaches to facilitate literature reviews in rapidly evolving technological fields. Given the effectiveness of this approach, it is possible that future systematic reviews or meta-analyses may benefit from a computational literature search and thematic extraction using the Meaning Extraction Method or similar techniques. That said, a nontrivial amount of expertise in computational methods, verbal behavior, and natural language processing was needed to create the VRbalARchive, let alone the resource cost associated with the LLM coding tasks. Approximately 144 hours of computing time were spent coding nearly 2 million abstracts and resolving discrepancies (starting at step one of our analytic process). Assuming two coders could each read 30 abstracts per hour, humans would take over 61,000 consecutive hours to complete the coding (nearly 7 years of time). This is clearly infeasible and a provocative demonstration to discuss challenges and opportunities with using LLMs for coding literature.
Limitations and future directions
In our query of OpenAlex, articles must have contained a VR-related term and not a desktop VR term. It is possible that some studies were missed if such media were being compared in a single study, and future work should consider evaluating how trends in desktop VR develop over time as well. Regarding AI coding, some work notes that reproducibility may be an issue 26 and we encourage future scholars to test if this concerns the coding of academic articles as well. We also present only correlational evidence in this paper using English abstracts. Future work may attempt to use more sophisticated analytic techniques like cross-lagged panel models to consider the causal direction of the variables in our work and use non-English texts for additional representation. Evaluating the thematic differences with full texts instead of abstracts should be considered, despite evidence that suggests abstracts and full texts have similar content and style.27,28 Finally, it is unclear if the patterns we observed for VR are unique to this medium or they might be represented in other media as well.
In the future, researchers may also seek to track the impact of landmark articles on VR publishing, investigating how one paper “set the tone” for a new subfield or trajectory of research. Another opportunity would be to investigate how individuals or labs have impacted VR’s history. We purposefully did not engage in this type of work to avoid good and bad finger-pointing, but this might produce helpful findings if done thoughtfully and carefully. Finally, using the archive to forecast the future of VR in parallel with other media innovations might be an instructive exercise for those in academia and industry. For example, we believe that generative AI may impact VR by developing bespoke narratives and immersive experiences with minimal human involvement (e.g., having generative models create new worlds for people based on their prior behavior, like adaptive testing), but it is unclear how this customization might impact laboratory research that prizes experimental control.
Footnotes
Authors’ Contributions
D.M.M.: Conceptualization, methodology, data curation, writing—original draft, visualization, investigation, writing—reviewing and editing. J.N.B.: Conceptualization, writing—original draft, writing—reviewing and editing.
Author Disclosure Statement
The authors declare they have no conflicts of interest to disclose.
Funding Information
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
