Abstract
This paper argues that the retraction crisis in scholarly publishing is increasingly entering large language model (LLM) training datasets through commercial publisher–AI licensing agreements that provide entire journal archives without retraction filtering. Using a conceptual-analytical approach, the paper synthesizes three strands of empirical literature: studies on retraction growth and post-retraction citation, research examining LLM interactions with retracted papers, and documented publisher–AI licensing agreements. The paper advances three theoretical findings. First, it identifies a direct and commercially formalised pipeline through which retracted papers enter LLM training corpora: bulk licensing agreements in which publishers sell entire journal archives to AI developers without retraction-filtered exports, thereby bypassing the retraction notification infrastructure that already exists and functions for human readers. Second, it introduces the contamination-absence asymmetry: citation decay removes valid evidence from AI knowledge while retraction propagation inserts invalid evidence into it, producing a compound failure mode in which AI systems simultaneously know less than they should and believe more than they should. Third, it proposes the governance gap hypothesis: that retraction infrastructure — Retraction Watch, Crossref retraction flags, PubMed retraction notices — constitutes a functioning information system that stops at the human reader and was never designed to intercept AI ingestion, representing a gap in information governance whose closure is a library and information science responsibility. The study concludes that ensuring the integrity of AI-generated scientific knowledge requires stronger information governance and the active involvement of library and information science professionals in the oversight of AI training datasets.
Keywords
Get full access to this article
View all access options for this article.
