Contaminated corpora: How the retraction crisis is being silently encoded into AI scientific knowledge

Abstract

This paper argues that the retraction crisis in scholarly publishing is increasingly entering large language model (LLM) training datasets through commercial publisher–AI licensing agreements that provide entire journal archives without retraction filtering. Using a conceptual-analytical approach, the paper synthesizes three strands of empirical literature: studies on retraction growth and post-retraction citation, research examining LLM interactions with retracted papers, and documented publisher–AI licensing agreements. The paper advances three theoretical findings. First, it identifies a direct and commercially formalised pipeline through which retracted papers enter LLM training corpora: bulk licensing agreements in which publishers sell entire journal archives to AI developers without retraction-filtered exports, thereby bypassing the retraction notification infrastructure that already exists and functions for human readers. Second, it introduces the contamination-absence asymmetry: citation decay removes valid evidence from AI knowledge while retraction propagation inserts invalid evidence into it, producing a compound failure mode in which AI systems simultaneously know less than they should and believe more than they should. Third, it proposes the governance gap hypothesis: that retraction infrastructure — Retraction Watch, Crossref retraction flags, PubMed retraction notices — constitutes a functioning information system that stops at the human reader and was never designed to intercept AI ingestion, representing a gap in information governance whose closure is a library and information science responsibility. The study concludes that ensuring the integrity of AI-generated scientific knowledge requires stronger information governance and the active involvement of library and information science professionals in the oversight of AI training datasets.

Keywords

contaminated corpora retraction crisis large language models AI training data information governance scientific integrity publisher-AI licensing

Get full access to this article

View all access options for this article.

References

Else

(2024) Biomedical paper retractions have quadrupled in 20 years — why? Nature 630(8016): 280–281. https://doi.org/10.1038/d41586-024-01609-0

Feng

, et al. (2025) Alarm: retracted articles on cancer imaging are not only continuously cited by publications but also used by ChatGPT to answer questions. Journal of Advanced Research 71: 1–3. https://doi.org/10.1016/j.jare.2025.03.020

Hansen

(2024) What happens when your publisher licenses your work for AI training? – authors alliance. Available at: https://www.authorsalliance.org/2024/07/30/what-happens-when-your-publisher-licenses-your-work-for-ai-training/ (accessed 15 March 2026).

Hsiao

Schneider

(2022) Continued use of retracted papers: temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine. Quantitative Science Studies 2(4): 1144–1169. https://doi.org/10.1162/qss_a_00155

Jensen

(2016) Editorial. ACM Transactions on Database Systems 41(2): 1–3. Available at: https://doi.org/10.1145/2946798

Keralis

Albertorio-Díaz

Hoppe

(2023) Dark citations to federal resources and their contribution to the public health literature. Frontiers in Research Metrics and Analytics 8: 1235208. https://doi.org/10.3389/frma.2023.1235208

Kwon

(2024) Publishers are selling papers to train AIs - and making millions of dollars. Nature 636(8043): 529–530. https://doi.org/10.1038/d41586-024-04018-5

Palmer

(2024) Taylor & Francis AI deal sets ‘worrying precedent’ for Academic Publishing. Available at: https://www.insidehighered.com/news/faculty-issues/research/2024/07/29/taylor-francis-ai-deal-sets-worrying-precedent (accessed 15 March 2026).

Petrou

(2024) Guest post - making sense of retractions and tackling research misconduct - the scholarly kitchen. Available at: https://scholarlykitchen.sspnet.org/2024/04/18/guest-post-making-sense-of-retractions-and-tackling-research-misconduct/ (accessed 15 March 2026).

10.

Tang

(2023) Some insights into the factors influencing continuous citation of retracted scientific papers. Publications 11(4): 47. https://doi.org/10.3390/publications11040047

11.

Thelwall

Lehtisaari

Katsirea

, et al. (2025) Does ChatGPT ignore article retractions and other reliability concerns? Learned Publishing 38(4): e2018. https://doi.org/10.1002/leap.2018

12.

Van Noorden

(2023) More than 10,000 research papers were retracted in 2023 - a new record. Nature 624(7992): 479–481. https://doi.org/10.1038/d41586-023-03974-8

13.

Woo

Walsh

(2024) On the shoulders of fallen giants: what do references to retracted research tell us about citation behaviors? Quantitative Science Studies 5(1): 1–30. https://doi.org/10.1162/qss_a_00303

14.

Zhou

Lou

Shen

, et al. (2025) Prevalence and Trends in Global Retractions Explored Through a Topic Lens. arXiv. Epub ahead of print 26 November 2025.