Ai-Driven Detection of Industry-Specific Financial Fraud: A Theory-Informed NLP Approach

Abstract

This study uses natural language processing and machine learning to classify financial fraud based on the SEC Accounting and Auditing Enforcement Releases disclosures. It introduces a hybrid framework combining Latent Dirichlet Allocation topic modeling and supervised learning to identify five fraud categories: financial misstatements, bribery, tax fraud, investment fraud, and related-party transactions. These are mapped to industry sectors to reveal contextual vulnerabilities. Logistic regression delivers stable, interpretable results, while deep learning models capture procedural fraud patterns. Chi-square tests and clustering confirm significant industry–fraud links, highlighting the need for sector-specific detection. By aligning outcomes with criminological theories, the study contributes to fraud analytics and regulatory enforcement. The findings support forensic accountants and analysts in building targeted, industry-aware fraud detection systems.

Keywords

white collar crime fraud machine learning routine activity theory natural language processing

Introduction

Fraud encompasses diverse forms of financial misconduct, including bribery, financial misstatements, tax violations, related-party transactions, and investment fraud. Despite these differences in mechanisms, actors, and severity, most fraud detection research and regulatory tools treat fraud as a single, undifferentiated category (Bao et al., 2020; Cecchini et al., 2010; M. Lokanan & Sharma, 2024). Such generalization obscures industry-specific risk patterns and limits the ability to understand how distinct forms of misconduct emerge across organizational and regulatory contexts.

The core problem addressed in this study is the lack of industry-aware and theory-informed fraud classification. Existing approaches rarely distinguish fraud types or examine how misconduct varies across sectors, despite clear differences in opportunity structures, governance mechanisms, and regulatory oversight. As a result, research to date has mostly focused on applying machine learning to fraud prediction, with comparatively limited engagement with criminological theory. The present study draws on Routine Activity Theory (RAT), which suggests that fraud should vary systematically across industries due to differences in motivated actors, suitable targets, and guardianship conditions. Empirical approaches capable of capturing these patterns, however, remain underdeveloped.

Against this backdrop, the present study examines whether fraud typologies exhibit systematic variation across industry sectors and whether such patterns can be identified through natural language processing (NLP) applied to regulatory disclosures. Specifically, the study asks: Do distinct fraud typologies exhibit systematic associations with specific industry sectors, and can these relationships be identified using NLP applied to regulatory disclosures? To address this question, an NLP-based framework is developed using textual disclosures from the U.S. Securities and Exchange Commission’s (SEC) Accounting and Auditing Enforcement Releases (AAERs). Latent Dirichlet Allocation (LDA) and supervised classification are used to identify fraud types, while statistical testing and clustering techniques examine their distribution across industries. Analytical emphasis is placed on interpretability, ensuring that identified patterns can be linked to underlying opportunity structures rather than treated as outputs of a “black box” model.

The study contributes to the white-collar crime scholarship in three ways. First, the classification of fraud by type and industry enables more targeted identification of sector-specific risks, supporting improved regulatory and enforcement strategies. Second, the integration of RAT with NLP-based analysis advances a theory-driven approach to fraud detection, linking observed patterns to variations in opportunity and guardianship across industries. Third, mapping fraud typologies directly to industry contexts extends existing research by moving beyond binary fraud detection toward a structured understanding of how different forms of misconduct cluster within sectors.

The paper is organized into seven sections. The literature review outlines prior research on fraud detection in accounting and white-collar criminology, highlighting gaps in applying machine learning and NLP to financial misconduct. The theoretical framework introduces RAT and CSA as lenses for understanding structural and procedural aspects of fraud. The methods section details data collection, preprocessing, and model development using topic modelling, supervised learning, and deep learning. Results present model performance and sectoral patterns identified through chi-square tests and network analysis. The discussion interprets these findings within criminological and institutional contexts, and the conclusion summarizes key contributions and directions for future research.

Related Literature

Conceptualizing Fraud: Disciplinary Definitions and Contextual Boundaries

Fraud is a relative and discipline-specific concept, defined differently across criminology and accounting and finance. In criminology, fraud—often framed within the broader category of white-collar crime—is understood as a non-violent act of deception committed for financial or personal gain through the abuse of trust, manipulation, or concealment (Shapiro, 1990, pp. 347–348; Sutherland, 1949, p. 9). Accounting research, by contrast, defines fraud as the intentional misrepresentation or omission of financial information designed to mislead stakeholders and distort financial outcomes (Cooper et al., 2013, pp. 440–441; M. E. Lokanan, 2015, p. 203; Matthews, 2005, p. 520). While both disciplines converge on the central role of deceit and financial advantage, criminology emphasizes systemic violations of trust and the social mechanisms enabling deception, whereas accounting focuses on fraudulent reporting practices within organizational and regulatory contexts.

Understanding fraud as an abuse of trust rather than a fixed behavioral trait allows for its classification into more specific types—such as bribery, investment fraud, related-party transactions, misrepresentation, and tax fraud—each shaped by distinct motivations, actors, and industry vulnerabilities (Shapiro, 1990; Sutherland, 1949). Treating these forms as interchangeable under a single definition obscures key operational distinction and reduces the precision of detection models. The lack of differentiation among fraud types across industries represents a fundamental shortcoming in both academic research and applied detection frameworks. Machine learning systems trained on uniform definitions of fraud risk oversimplify its structural and contextual diversity, leading to lower detection accuracy and limited cross-sector adaptability. Integrating industry context into fraud detection frameworks is therefore both theoretically essential and operationally necessary to build more targeted, risk-sensitive, and context-aware detection systems. Criminological perspectives further suggest that such variation is not random but shaped by differences in opportunity structures across industries. Identifying these patterns requires approaches capable of capturing context-dependent signals within unstructured disclosures, motivating the integration of RAT with interpretable analytical methods.

Machine Learning in Fraud Detection

Advances in artificial intelligence and machine learning have reshaped fraud detection research in accounting and finance. Most studies rely on predictive models using structured data such as financial statements, stock prices, and firm characteristics (Bao et al., 2020; Bertomeu et al., 2021; Cecchini et al., 2010). Supervised algorithms—including linear, probabilistic, ensemble, and neural network models—demonstrate strong predictive accuracy, particularly in capturing non-linear relationships in financial data (Bao et al., 2020; Wu et al., 2025). Despite these advances, fraud is typically treated as a binary construct—fraudulent or not—without accounting for variation in type or industry context (Bao et al., 2020; M. Lokanan & Sharma, 2024). Such approaches overlook distinctions among financial misstatements, bribery, tax evasion, and related-party transactions, each shaped by different actors, incentives, and concealment strategies (Coleman, 1987). Collapsing these forms into a single category reduces analytical precision and limits the practical value of detection systems (Davis & Pesch, 2013; Power, 2013).

Interpretability remains a central limitation. Many models operate as black-box systems, restricting their usefulness for auditors and regulators who require transparent, explainable outputs linked to observable behaviors. As a result, improved fraud detection depends not only on predictive performance but also on models that are interpretable and sensitive to contextual variation. Recent developments in NLP extend fraud detection to unstructured textual data, including corporate disclosures and regulatory filings (Bochkay et al., 2023; Feuerriegel & Pröllochs, 2021). Topic modelling, particularly LDA, has been used to identify latent themes related to managerial tone and disclosure patterns. However, applications remain limited in scope and rarely differentiate fraud types or examine how misconduct varies across industries (Cheng & Cai, 2023; Wang & Xu, 2018). The prevailing focus on generalized indicators of deception therefore overlooks the structural and sectoral dimensions of fraud.

Existing approaches therefore lack a framework for explaining why different forms of fraud emerge across industries and organizational contexts. Criminological theories, particularly RAT, suggest that such variation is driven by differences in opportunity structures, including the configuration of actors, targets, and guardianship. However, empirical methods capable of systematically identifying these patterns from unstructured disclosures remain limited. Addressing this gap requires approaches that can extract interpretable, context-sensitive signals from textual data, motivating the use of NLP techniques in the present study.

Theoretical Foundations: Routine Activity Theory

Very few studies have applied machine learning to fraud detection have engaged explicitly with theory to explain the contextual factors—such as industry structure, regulatory oversight, and sector-specific risks—that shape how misconduct emerges (M. Lokanan & Sharma, 2024; Wu et al., 2025). Most research focuses on refining model performance or algorithmic accuracy without accounting for how distinct forms of fraud arise under different organizational and institutional conditions (Bao et al., 2020; Cecchini et al., 2010; Perols, 2011). Conceptual distinctions among fraud types remain underdeveloped, forcing detection systems into reductive frameworks that treat all misconduct as uniform. The present study integrates criminological theory—specifically RAT —to ground AI-based models in explanations that link fraud typologies to their industry contexts.

RAT provides a structural framework for analyzing conditions that enable corporate fraud. Crime occurs when a motivated offender, a suitable target, and the absence of capable guardianship converge in time and space (Cohen & Felson, 2015; Pratt et al., 2010). Originally developed to explain conventional crimes, RAT has since been applied to white-collar and cyber offenses (Leukfeldt & Yar, 2016; Natarajan, 2016). RAT emphasis on opportunity structures—shaped by organizational and environmental conditions—offers insight into why certain fraud types cluster within specific industries (Kleemans et al., 2012; Paternoster & Simpson, 1993; Williams et al., 2019). In corporate contexts, suitable targets include financial reporting systems or large transactions, while guardianship may take the form of internal controls, auditing, or regulatory oversight.

RAT conceptualizes fraud as an outcome of situational opportunities rather than individual deviance (Cohen & Felson, 2015; Schaefer, 2021). Industry sectors with weak regulation, opaque transactions, or high managerial discretion create conditions of low guardianship and elevated fraud risk. Although the framework prioritizes external structures and may overlook cultural or relational dynamics within firms (Kleemans et al., 2012; Schaefer, 2021; Steinmetz, 2025), it remains valuable for identifying environmental asymmetries that shape sector-specific vulnerabilities. Understanding how opportunity structures differ across industries enhances the interpretability of NLP-based classification models. High-risk sectors such as finance or mining may face increased exposure to tax fraud, while bribery often concentrates in construction or energy industries due to procurement vulnerabilities. These theoretical elements are operationalized in the empirical analysis through the identification of linguistic indicators of actors, transactions, and control conditions within AAER disclosures, allowing fraud patterns to be interpreted as variations in opportunity structures across industries.

Methodological Framework

The methodological design prioritizes transparency and interpretability to ensure that the analytical process remains explainable and aligned with criminological inquiry, rather than operating as a “black box” predictive system. The approach supports theory-driven analysis, with techniques selected to uncover structured, industry-specific patterns of fraud consistent with RAT. Emphasis is placed on linking model outputs to meaningful sectoral dynamics, reinforcing the integration of computational methods with criminological explanations.

Dataset: SEC AAER-Based Corporate Fraud Disclosures

The dataset is drawn from the SEC’s AAERs, a widely used source in financial fraud research. It includes 4,278 disclosures issued between May 17, 1982, and December 31, 2023, capturing 1,816 firm-level fraud events across approximately 1,364 unique firms. Each observation consists of a narrative disclosure detailing the misconduct, implicated actors, affected accounts, and procedural elements of fraud. These disclosures provide a rich, unstructured textual foundation for analysis, allowing fraud to be examined as a context-dependent and socially embedded phenomenon rather than a purely technical anomaly. RAT informs the use of these narratives by framing fraud in terms of motivated offenders, suitable targets, and the absence of capable guardianship. Narrative detail within AAERs enables these elements to be identified and systematically analyzed across industries. Consequently, the dataset supports a theory-driven approach in which linguistic patterns reflect underlying opportunity structures rather than serving as inputs to a purely predictive “black box” model.

Data Source

The structured AAER dataset was obtained from the University of Southern California, where it has been compiled for academic use. Each entry includes structured identifiers (e.g., firm name, AAER number, year) alongside detailed narrative disclosures that explain the nature and context of enforcement actions. Narrative content is particularly suited to operationalizing RAT within the analytical framework. Disclosures consistently contain:

• Nature of Misconduct: Descriptions of financial wrongdoing (e.g., revenue inflation, bribery, tax fraud, and investment manipulation) reflecting the type of target and method of exploitation.

• Entities Involved: Identification of organizational and individual actors (e.g., executives, auditors, subsidiaries), enabling the mapping of offender roles.

• Contextual Information: References to internal control weaknesses, financial pressures, and industry-specific conditions, which signal variations in guardianship and opportunity structures.

Together, these elements provide interpretable, theory-relevant indicators that allow fraud typologies to be linked to industry contexts. The analytical focus, therefore, extends beyond classification to identifying structured patterns of opportunity and control consistent with criminological theory.

NLP Workflow

Text Preprocessing and Feature Engineering

AAER narrative disclosures were processed using a streamlined NLP pipeline to standardize text while preserving semantically meaningful cues relevant to fraud classification. Preprocessing focused on reducing noise and ensuring consistency across disclosures so that linguistic patterns reflecting fraud mechanisms and industry context could be identified in a transparent and interpretable manner (Chang et al., 2022; Khurana et al., 2023). The pipeline included lowercasing, removal of punctuation and non-alphanumeric characters, tokenization, and stopword removal to eliminate syntactic noise. Stemming (Porter Stemmer) and lemmatization (WordNet Lemmatizer) were applied to unify word forms and improve the interpretability of key terms. These steps ensure that language linked to actors, transactions, and control environments—central to RAT—is retained, allowing subsequent analysis to capture structured indicators of opportunity and guardianship across industries. The resulting normalized text provides a consistent and interpretable input for TF-IDF vectorization, topic modeling, and classification. Below is a breakdown of the preprocessing steps:

• Lowercasing: All text was converted to lowercase using Python’s.lower() function to ensure uniformity and to avoid treating semantically identical tokens as distinct (e.g., “Bribe” and “bribe”).

• Removing Punctuation and Special Characters: Punctuation marks and non-alphanumeric symbols were removed using regular expressions to eliminate noise and retain only meaningful textual content.

• Tokenization: The cleaned text was tokenized into individual words using word_tokenize() from NLTK, creating the basic units for linguistic and statistical transformation.

• Stopword Removal: Common English stopwords were removed using NLTK’s built-in stopword list to reduce syntactic clutter and retain only informative words relevant to fraud classification.

• Stemming: The Porter Stemmer from NLTK was applied to reduce words to their root form (e.g., “misrepresenting” → “misrep”), helping to unify word variants and reduce vocabulary size.

• Lemmatization: The WordNet Lemmatizer was also applied to refine word forms to their base lemma (e.g., “violations” → “violation”), ensuring grammatical consistency and improving interpretability.

Text to Numerical Conversion

Cleaned disclosures were transformed into numerical features using Term Frequency–Inverse Document Frequency (TF-IDF). TF-IDF was selected because it emphasizes terms that are both frequent within a document and distinctive across the corpus, allowing fraud-related language to be identified in a transparent and interpretable manner. Such weighting supports the identification of terms associated with specific fraud typologies and industry contexts, aligning feature construction with the study’s focus on explainable, theory-informed analysis rather than opaque representations. The TF-IDF weight for a term t in document d from corpus D is given by:

TF ‐ IDF (t, d, D) = TF (t, d) \times IDF (t, D)

Where:

Term Frequency (TF):

TF (t, d) = \frac{f_{t, d}}{\sum_{k} f_{k, d}}

• $f_{t, d}$ : Frequency of term t in document d

• $\sum_{k} f_{k, d}$ : Total number of terms in document d

The Term Frequency measures how often term t appears in document d relative to its length.

Inverse Document Frequency (IDF):

IDF (t, D) = \log (\frac{N}{1 + | {d \in D : t \in d} |})

• N: Total number of documents in the corpus

• $| {d \in D : t \in d} |$ : Number of documents in which term t appears

• The “+1” in the denominator prevents division by zero

The IDF measures how unique or rare the term t is across the entire corpus.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) was applied to identify latent thematic structures within AAER disclosures, allowing fraud typologies to emerge from recurring linguistic patterns. Using TF-IDF and CountVectorizer representations, LDA grouped co-occurring terms into interpretable clusters reflecting procedural elements of fraud, including patterns such as “round-tripping,” “sham transactions,” and “undisclosed related parties.” These patterns provide structured indicators of fraud mechanisms that extend beyond predefined categories. LDA models each document as a mixture of topics and each topic as a distribution over words. The probability of observing a word w in a document d is given by:

P (w ∣ d) = \sum_{k = 1}^{K} P (w ∣ z = k) \cdot P (z = k ∣ d)

where K is the number of topics,

P (w ∣ z = k)

represents the probability of word w given topic k, and

P (z = k ∣ d)

is the probability of topic k in document d. Top terms and representative documents were examined to ensure semantic coherence and interpretability. Identified topics were then mapped to five fraud typologies—Financial Misstatements, Tax Fraud, Bribery, Investment Fraud, and Loan/Related Party Transactions—linking linguistic patterns to structured forms of misconduct. These topic structures capture recurring configurations of actors, transactions, and control conditions, enabling interpretation of fraud patterns in line with RAT.

Independent and Dependent Variables

Following preprocessing and LDA-based topic modelling, the dataset was structured for supervised classification. The dependent variable—fraud type—was derived through keyword analysis and manual annotation of AAER narratives. These labels reflect distinct forms of misconduct and serve as the target for prediction. The independent variables consist of tokenized word distributions from the “explanation” sections, transformed into numerical features using TF-IDF weighting. Such features retain contextually relevant language tied to fraud mechanisms, actors, and control conditions, supporting an interpretable representation of industry-specific misconduct. Variable construction, therefore, aligns with RAT by capturing linguistic indicators of opportunity structures and variations in guardianship across sectors, rather than relying on opaque feature representations. Table 1 presents key terms associated with each fraud type, identified through LDA and domain expertise, forming the input features for model training.

Table 1.

Fraud Types

Crime category	Associated keywords
Tax fraud	Tax, income, undisclosed, improperly, fraudulent, accounting, taxes
Financial misstatements	Inventory, sales, accounts, financial, auditor, recorded, firm, audit, false, statements, capitalized, independence, costs, reserves
Loan and related-party transactions	Loan, related, party, transactions, account, losses, disclose, failed
Investment fraud	Scheme^a, misappropriation, fictitious, stock, company, funds, backdated, option
Bribery	Bribe, bribery, kickback, corruption, improper payment

^aSchemes includes fraudulent stock operations.

Modeling Performance

To classify fraud types from narrative disclosures, a set of supervised and deep learning models was implemented, including Logistic Regression, Random Forest, Naïve Bayes, Support Vector Machine (SVM), RNN + LSTM, and BiLSTM. Model selection balances predictive performance with interpretability, allowing comparison between transparent approaches and more complex architectures. All models were trained on TF-IDF features derived from AAER disclosures, with fraud type as the target variable. Model performance (Table 2) was evaluated using accuracy, precision, recall, F1-score, and AUROC to assess classification performance across fraud categories, particularly under class imbalance. Emphasis is placed on F1-score and AUROC as they provide a more reliable assessment of model effectiveness in distinguishing fraud types across industry contexts.

Table 2.

Performance Measures

Metric	Formula
Accuracy	(TP + TN)/(TP + TN + FP + FN)
Precision	TP/(TP + FP)
Recall (sensitivity)	TP/(TP + FN)
F1-score	2 * (precision * recall)/(precision + recall)
AUROC	AUROC = $\int TPR d (FPR)$

Experiment: SMOTE vs. No-SMOTE Evaluation With Machine Learning Classifiers

To thoroughly assess model performance and better understand fraud patterns across industries, two experimental setups were used: one with Synthetic Minority Oversampling Technique (SMOTE) resampling and one without. The dataset exhibited significant class imbalance, with fraud types like Financial Misstatements (n = 891) and Tax Fraud (n = 413) greatly outnumbering less frequent categories such as Loan and Related-Party Transactions (n = 218), Investment Fraud (n = 119), and Bribery (n = 64). To account for this imbalance, models were trained under two conditions: (1) using the original imbalanced dataset and (2) applying SMOTE, which generates new instances for minority classes by interpolating between nearest neighbors (Chawla et al., 2002; Elreedy & Atiya, 2019), resulting in a balanced dataset of 624 samples per class. Comparing SMOTE and non-SMOTE conditions clarified how class imbalance, synthetic augmentation, and industry-specific data distributions affect model sensitivity, classification accuracy, and detection of rare fraud types.

Algorithms Employed

Table 3 summarizes the classification algorithms used to predict fraud types from the SEC’s AAER narratives, including probabilistic, linear, ensemble, margin-based, and deep learning models. The model set is intentionally diverse to compare how different learning approaches capture patterns in fraud-related language. Emphasis is placed on interpretable models, particularly Logistic Regression and Random Forest, as they allow a clearer linkage between linguistic features and fraud typologies. More complex models, including RNN + LSTM and biLSTM, are included to assess whether sequential and contextual language patterns improve classification performance. The use of multiple algorithms is not intended to advance model complexity but to evaluate how consistently fraud patterns—reflected in narrative indicators of actors, transactions, and control conditions—can be identified across methods. Such patterns are interpreted in relation to industry-specific variations in opportunity and guardianship, consistent with RAT’s framework.

Table 3.

Algorithms Employed

Category	Algorithm	Description	Representative formula
Probabilistic models	Naïve Bayes	Uses Bayes’ theorem with feature independence assumption to estimate class probabilities.	$P (C_{k} ∣ x) = \frac{P (C_{k}) \prod_{i = 1}^{n} P (x_{i} ∣ C_{k})}{P (x)}$
Linear models	Logistic regression	Models the log-odds of the target as a linear combination of input features.	$\hat{p} = \frac{1}{1 + e^{- (β_{0} + β_{1} x_{1} + \dots + β_{n} x_{n})}}$
Ensemble learning	Random forest	Combines multiple decision trees using bootstrap aggregation and majority voting.	$\hat{y} = mode (h_{1} (x), h_{2} (x), \dots, h_{T} (x))$
Margin-based learning	Support vector machine (SVM)	Constructs a hyperplane to maximize the margin between classes.	$f (x) = sign (w \cdot x + b)$
Deep learning	RNN with LSTM units	Learns sequential dependencies using memory cells and gating mechanisms.	$h_{t} = LSTM (x_{t}, h_{t - 1})$
Deep learning	biLSTM	Captures long-term dependencies in both forward and backward directions using two LSTM layers.	$h_{t} = biLSTM (x_{t}, {\vec{h}}_{t}^{\leftarrow 1}, {\overset{\leftarrow}{h}}_{t}^{+ 1})$

Findings and Analysis

Exploratory Data Analysis

Figure 1 presents a word cloud illustrating the most frequent terms in the fraud dataset. Prominent words such as asset, revenue, expense, sale, and overstated reflect recurring patterns of financial misrepresentation, including inflated revenues, misstated expenses, and improper asset reporting across industries. The frequent appearance of failed suggests widespread regulatory noncompliance, while terms such as income, inventory, cost, and recorded point to manipulation of core accounting metrics. Procedural terms including bribe, scheme, and transaction highlight mechanisms through which fraud is enacted. Together, these patterns provide an initial indication of how fraud manifests in narrative disclosures and suggest industry-specific operational and regulatory vulnerabilities.

Figure 1.

Word cloud of frequent words

Figure 2 shows the distribution of fraud types across industry sectors, highlighting sector-specific vulnerabilities. Financial misstatements dominate in Mining (82.14%), Business Services (71.43%), and Software (70.59%), indicating that complex reporting environments heighten manipulation risk. Investment fraud is most prevalent in the investment sector (36.36%) and appears frequently in Pharmaceuticals and Information Technology, reflecting valuation uncertainty. Bribery is concentrated in Pharmaceuticals (28.95%), consistent with regulatory and procurement exposure. Tax fraud is more dispersed but prominent in Construction (45.45%), Retail (30.26%), and Real Estate (30.95%), sectors characterized by flexible revenue recognition and deductions.

Figure 2.

Fraud type frequency by industry

Chi-Square Test of Independence

A Chi-Square Test of Independence assessed whether fraud patterns vary across industries, yielding a chi-square statistic of 236.79 (df = 124, p = 4.54e-9). The result rejects the null hypothesis, confirming a statistically significant association between fraud type and industry. Several sectors are disproportionately linked to specific forms of misconduct, underscoring the need for industry-sensitive detection strategies. Table 4 summarizes industry-level deviations in fraud patterns based on standardized residuals. Pharmaceuticals emerge as a strong outlier for bribery (residual = 7.8), indicating substantially higher occurrence than expected. Loan and related-party transaction fraud is overrepresented in Banking (4.45), Financial Services (2.64), and Investment Management (2.32), while investment fraud is disproportionately concentrated in the Investment sector (3.66). Mining also shows elevated financial misstatements (2.1). In contrast, Technology exhibits significantly fewer loan and related-party fraud cases than expected (−2.97). These deviations highlight the industry-specific nature of financial misconduct and provide actionable insight for risk assessment, internal control design, and regulatory enforcement.

Table 4.

Residual Analysis of Industry-Level Fraud Distribution

Industry	Crime category	Residual	Interpretation
Pharmaceuticals	Bribery	7.8	Much more bribery than expected
Banking	Loan and related-party transactions	4.45	More loan fraud than expected
Investment	Investment fraud	3.66	Much more investment fraud than expected
Financial services	Loan and related-party transactions	2.64	More loan fraud than expected
Investment management	Loan and related-party transactions	2.32	More loan fraud than expected
Mining	Financial misstatements	2.1	More financial misstatements than expected
Technology	Loan and related-party transactions	−2.97	Much less loan fraud than expected

HDBSCAN Clustering Techniques

Figure 3 shows HDBSCAN clustering applied to UMAP-reduced industry fraud risk profiles. HDBSCAN was selected for its ability to identify density-based clusters without predefining cluster numbers. The results reveal distinct industry groupings with similar fraud risk characteristics, including a cluster linking Telecommunications, Software, and Food & Beverage, and another aggregating Banking, Insurance, Real Estate, and Construction—sectors associated with complex, institutional forms of misconduct. A silhouette score of 0.48 indicates moderate cluster separation, suggesting meaningful but overlapping fraud typologies across industries. These clusters support the development of industry-specific fraud detection and mitigation strategies.

Figure 3.

Hdbscan clustering of industry fraud risk group

Machine Learning Model Results

Table 5 reports classification performance under SMOTE and No-SMOTE conditions. Logistic Regression produced the most balanced and consistent results, achieving strong precision, recall, and F1 scores across all fraud types, including minority classes such as Investment Fraud and Related-Party Transactions. The F1 score proved particularly informative in this imbalanced multi-class setting. SMOTE generally improved recall for underrepresented fraud types—most notably for Random Forest and Naïve Bayes—but often reduced precision, resulting in uneven performance gains. For instance, Random Forest identified more Investment Fraud cases under SMOTE but with lower precision. SVM performed well on dominant categories such as Financial Misstatements and Tax Fraud but struggled with rare classes, particularly under oversampling. Logistic Regression remained stable and interpretable across sampling strategies, highlighting its suitability for industry-specific fraud classification. Overall, the results show that SMOTE effectiveness is model-dependent, reinforcing the need to align resampling techniques with algorithmic behavior to avoid performance trade-offs (Chawla et al., 2002; Elreedy & Atiya, 2019).

Table 5.

Results of Machine Learning Models

Fraud category	SMOTE			No-SMOTE
Fraud category	Precision	Recall	F1-score	Precision	Recall	F1-score
Naïve Bayes
Bribery	0.62	0.82	0.71	0.68	0.77	0.72
Financial misstatements	0.84	0.67	0.74	0.83	0.78	0.80
Investment fraud	0.48	0.44	0.46	0.71	0.47	0.57
Related party transactions	0.50	0.55	0.52	0.58	0.50	0.54
Tax fraud	0.64	0.86	0.73	0.72	0.89	0.80
Logistic regression
Bribery	0.86	0.86	0.86	0.95	0.86	0.90
Financial misstatements	0.94	0.96	0.95	0.92	0.97	0.94
Investment fraud	0.78	0.78	0.78	0.79	0.75	0.77
Related party transactions	0.85	0.81	0.83	0.84	0.72	0.78
Tax fraud	1.00	0.98	0.99	1.00	0.98	0.99
SVM
Bribery	0.87	0.91	0.89	0.83	0.86	0.84
Financial misstatements	0.89	0.93	0.91	0.87	0.94	0.91
Investment fraud	0.65	0.56	0.60	0.75	0.58	0.66
Related party transactions	0.69	0.72	0.71	0.74	0.64	0.69
Tax fraud	0.99	0.90	0.94	0.99	0.94	0.96
Random forest
Bribery	0.77	0.91	0.83	0.95	0.82	0.88
Financial misstatements	0.91	0.92	0.91	0.84	0.95	0.89
Investment fraud	0.70	0.78	0.74	0.84	0.44	0.58
Related party transactions	0.70	0.67	0.68	0.72	0.62	0.67
Tax fraud	0.97	0.90	0.93	1.00	0.95	0.98

Table 6 presents the overall accuracy of the four classification models under SMOTE and No-SMOTE conditions, emphasizing their effectiveness in industry-specific fraud detection. Logistic Regression achieved the highest accuracy in both settings, with a slight improvement under SMOTE (0.93 vs. 0.92), confirming its robustness in handling class imbalance without compromising predictive reliability across diverse industry sectors. Random Forest and SVM demonstrated stable accuracy (SVM: 0.88 to 0.87; Random Forest: 0.87 to 0.88), indicating their general resilience to synthetic oversampling and suitability for capturing fraud patterns across varied operational contexts. In contrast, Naïve Bayes showed a notable decline in accuracy under SMOTE (0.76 to 0.71), suggesting its limitations in generalizing from synthetically augmented fraud cases—particularly in industries with rare or complex fraud types. These results reinforce earlier F1-score findings, highlighting Logistic Regression as the most stable model for complex, industry-aligned fraud classification and affirming the need for model evaluation frameworks that consider both global accuracy and class-specific performance.

Table 6.

Results of Performance Accuracy

Algorithms	Smote	No-Smote
Algorithms	Accuracy	Accuracy
Naïve Bayes	0.71	0.76
Logistic regression	0.93	0.92
SVM	0.87	0.88
Random forest	0.88	0.87

In the context of imbalanced datasets, overall accuracy is insufficient, as it may mask underperformance on minority fraud categories often tied to specific industries. ROC analysis offers a more robust evaluation by measuring the trade-off between true positive and false positive rates across thresholds. As illustrated in Figure 4, micro-averaged ROC curves confirm that Logistic Regression, SVM, and Random Forest are among the top-performing models in detecting fraud across industry sectors. Random Forest yielded the highest micro-average AUC (0.99), indicating superior discriminatory power, followed closely by Logistic Regression and SVM (0.98). Naïve Bayes performed notably worse, with an AUC of 0.92, aligning with its weaker classification metrics. These findings underscore that while Logistic Regression maintains balanced performance across fraud types and industries, Random Forest offers marginally better overall ranking capability in identifying industry-specific fraud risks.

Figure 4.

Machine learning model ROC curves results

Deep Learning Models

Table 7 reports the performance of the deep learning models—RNN + LSTM and biLSTM—in classifying fraud types across industry sectors. Both models achieved high overall accuracy (0.85 and 0.84, respectively), comparable to Random Forest and SVM but slightly below Logistic Regression. The biLSTM model showed marginal improvements in identifying minority fraud types, such as Investment Fraud (F1 = 0.69) and Related Party Transactions (F1 = 0.66), suggesting potential utility in industry-specific fraud contexts. Despite these gains, neither model matched Logistic Regression in delivering consistent, balanced performance across all fraud categories. Deep learning models demonstrated strong results for prevalent fraud types—particularly Financial Misstatements and Tax Fraud—highlighting their strength in capturing dominant patterns within regulatory disclosures. However, the limited advantage of biLSTM over RNN + LSTM and over simpler algorithms indicates that while sequence-aware models add value in processing complex narratives, traditional approaches like Logistic Regression remain robust and interpretable tools for industry-informed fraud classification.

Table 7.

Deep Learning Results

Fraud category	Precision	Recall	F1-score	Accuracy
RNN + LSTM
Bribery	0.79	0.86	0.83
Financial misstatements	0.86	0.93	0.89
Investment fraud	0.76	0.53	0.62
Related party transactions	0.69	0.62	0.65
Tax fraud	0.92	0.86	0.89
				0.85
biLSTM
Bribery	0.81	0.89	0.85
Financial misstatements	0.88	0.87	0.88
Investment fraud	0.79	0.61	0.69
Related party transactions	0.67	0.65	0.66
Tax fraud	0.85	0.93	0.88
				0.84

Figure 5 presents the macro-averaged ROC curves for the RNN + LSTM and biLSTM models, comparing their discriminative performance across all fraud categories. Both models achieved high AUC values—0.96 for RNN + LSTM and 0.94 for biLSTM—indicating a strong capability in distinguishing between multiple fraud types across sectors. RNN + LSTM showed a slightly better trade-off, particularly at lower false positive thresholds, consistent with its higher overall accuracy and balance in precision and recall. While biLSTM’s bidirectional architecture offers theoretical advantages in capturing context, RNN + LSTM demonstrated superior macro-level discrimination. These findings suggest that, for industry-specific fraud classification, RNN + LSTM provides a more effective deep learning solution for handling complex narrative disclosures.

Figure 5.

Deep learning model ROC curves results

Discussion

Traditional fraud detection approaches often overlook the diverse nature of financial misconduct, failing to distinguish which industries are most associated with specific types of fraud (Bao et al., 2020; Bertomeu et al., 2021; Perols, 2011). Such generalizations limit regulatory and audit effectiveness, producing policies that may target the wrong behaviors. Addressing this limitation, the study applies NLP techniques—specifically supervised classification and topic modelling—to classify distinct fraud typologies and uncover their industry-specific patterns. Rather than treating fraud as uniform, the analysis highlights its heterogeneity across sectors, namely bribery, tax fraud, investment fraud, related-party transactions, and financial misstatements. The strong performance of logistic regression, supported by high F1 scores and ROC-AUC values, reinforces the interpretability of linear models (Cecchini et al., 2010; Perols, 2011), while deep learning adds nuance to classification. Together, these methods yield a more granular understanding of fraud and its sectoral contours, laying the groundwork for targeted and effective detection and enforcement strategies.

The application of SMOTE revealed that oversampling can improve recall for underrepresented fraud types—such as investment fraud and related-party transactions—but its benefits depend on model architecture and industry risk patterns. Logistic regression and random forest benefited most, achieving stronger balance between precision and recall in sectors where these frauds are prevalent. In contrast, Naïve Bayes and SVM showed signs of overfitting or degraded precision, particularly in industries with complex or heterogeneous disclosures. The findings challenge assumptions of universal efficacy in imbalanced classification (Chawla et al., 2002; Elreedy & Atiya, 2019) emphasizing the need to align resampling strategies with both model biases and sectoral dynamics.

Topic modelling results further support the view that fraud typologies are not randomly distributed but exhibit thematic and structural coherence within industry sectors. Successful mapping of latent patterns—identified through LDA and aligned with industry classifications—demonstrates that fraud emerges from patterned opportunity structures and procedural routines embedded within specific organizational contexts. Industries such as pharmaceuticals, banking, and investment management, which show elevated bribery and investment fraud risks, reflect RAT’s structural predicates: motivated offenders, suitable targets, and weak guardianship (Cohen & Felson, 2015; Leukfeldt & Yar, 2016). These environments often feature weakened internal controls, normalized complexity, and diffuse oversight, reinforcing RAT’s ecological logic (Kleemans et al., 2012; Pratt et al., 2010; Yar, 2005).

CSA complements RAT by illuminating the stages and methods underlying distinct fraud types. Deep learning models—RNN + LSTM and biLSTM—captured sequential narrative elements such as preparation (fictitious contracts), execution (asset inflation), and concealment (complex accounting entries). These phases mirror CSA’s conceptualization of fraud as a routinized process. Analysis of disclosures revealed sectoral differences in procedural cues, particularly concealment methods. The stronger performance of the biLSTM model on investment and tax fraud underscores that such misconduct is articulated through temporally ordered language.

Chi-square and clustering analyses revealed a differentiated fraud landscape shaped by sectoral dynamics: procurement practices in pharmaceuticals align with bribery, and opaque financial structures in banking with related-party transactions. These results provide empirical support for the claim that institutional environments shape both opportunity structures and procedural modalities of fraud (Gabbioneta et al., 2013; M. Lokanan, 2018; Morales et al., 2014). Clustering from HDBSCAN and UMAP showed that industries with similar fraud profiles coalesce into coherent communities—revealing fraud not as isolated organizational failure but as an outcome of shared structural logics across sectors.

The integrated theoretical framing grounded in RAT and CSA departs from actor-centered models by situating misconduct within broader structural and procedural contexts (Pratt et al., 2010; Schaefer, 2021). RAT outlines ecological conditions of opportunity, while CSA exposes the procedural choreography of fraud. Industries with high regulatory complexity and low guardianship—such as mining or financial services—illustrate RAT’s opportunity structures, and CSA explains how these are operationalized through repeatable behavioral patterns. Together, the theories present fraud as both structurally situated and procedurally enacted, supporting the need for context-sensitive models that move beyond binary classification toward richer understandings of organizational misconduct (Coleman, 1987; M. Lokanan, 2018; Shapiro, 1990). The theoretical grounding enhances the interpretability of model outputs, helping auditors, regulators, and compliance officers derive actionable insights that extend beyond anomaly detection toward sector-aware evaluations of financial misconduct (Perols, 2011).

Conclusion

At first glance, the present study may appear to depart from conventional fraud research, with its emphasis on algorithmic modeling and textual classification rather than on the proximate causes of financial wrongdoing. Yet, from a broader vantage point, what emerges is a significant contribution to the accounting literature: a reframing of fraud not simply as an individual or organizational transgression but as a patterned and discursively embedded phenomenon. By integrating NLP techniques with criminological theories, the study reveals a more granular and context-sensitive understanding of fraud typologies—one that bridges predictive analytics with the structural and procedural logics through which financial misconduct unfolds. Rather than limiting inquiry to the binary detection of fraud, the analysis situates financial misconduct within a broader epistemic field—one that traces the patterned emergence of fraud typologies through latent discursive structures and sectoral regularities. The integration of topic modeling, supervised classification, chi-square analysis, and graph-based clustering reveals a differentiated architecture of fraud wherein acts such as bribery, tax evasion, and related-party transactions are not random deviations but manifestations of deeper institutional logics. These findings suggest that the legibility of fraud is governed not merely by statistical anomalies but by historically sedimented industry norms, oversight structures, and procedural routines that together delimit what is seen, how it is interpreted, and by whom it is rendered actionable.

A key insight to emerge from this analysis lies in the necessity of stepping beyond narrow, ontological accounts of fraud that reduce misconduct to individual pathology or isolated regulatory failure. Viewed through the structural prism of RAT, the research maps how weakened internal controls, complex transactional environments, and diffuse regulatory oversight create patterned opportunity structures that accommodate semantically porous and context-dependent fraud typologies. In parallel, CSA advances this line of inquiry by exposing the sequential logic through which fraudulent practices are enacted—revealing a procedural architecture encoded in the language of narrative disclosures. Together, these frameworks reveal fraud to be both contextually contingent and procedurally routinized, offering an integrated theoretical anchor for interpreting model outputs and guiding regulatory or investigative action.

Limitations and Future Research

The study focuses on fraud case narratives and industry classifications primarily within a specific regulatory or institutional context. As such, the structural patterns and linguistic cues used to classify fraud types may not generalize across jurisdictions with different regulatory environments, legal definitions of fraud, or disclosure norms. Aligned with the broader discursive turn in critical accounting fraud scholarship, future research should move beyond a narrowly realist stance by applying the NLP-based techniques across jurisdictions and temporal spans, thereby acknowledging the contextual and symbolic dimensions that shape fraud classification. Extending the analysis to multilingual corpora and varied regulatory environments would enable a deeper interrogation of how legal, cultural, and institutional discourses mediate the construction and detection of fraud—yielding a more nuanced and globally attuned understanding of financial misconduct, including its veiled, subterranean, and symbolically coded manifestations.

Although deep learning models such as biLSTM demonstrate strong performance in identifying fraud typologies, the opacity of their internal decision-making processes limits their practical utility in high-stakes regulatory or auditing environments, where explainability is critical. Integrating model-agnostic interpretability tools such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) could help unpack the model’s decisions. Additionally, combining interpretable models like logistic regression with deep learning outputs in a hybrid framework may provide a balance between predictive power and explanatory value.

Footnotes

ORCID iD

Mark Lokanan

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bao

Y. J.

Zhang

(2020). Detecting accounting fraud in publicly traded U.S. firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235. https://doi.org/10.1111/1475-679X.12292

Bertomeu

Cheynel

Floyd

Pan

(2021). Using machine learning to detect misstatements. Review of Accounting Studies, 26(2), 468–519. https://doi.org/10.1007/s11142-020-09563-8

Bochkay

Brown

S. V.

Leone

A. J.

Tucker

J. W.

(2023). Textual analysis in accounting: What’s next? Contemporary Accounting Research, 40(2), 765–805. https://doi.org/10.1111/1911-3846.12825

Cecchini

Aytug

Koehler

G. J.

Pathak

(2010). Detecting management fraud in public companies. Management Science, 56(7), 1146–1160. https://doi.org/10.1287/mnsc.1100.1174

Chang

J.-W.

Yen

Hung

J. C.

(2022). Design of a NLP-empowered finance fraud awareness model: The anti-fraud chatbot for fraud detection and fraud classification as an instance. Journal of Ambient Intelligence and Humanized Computing, 13(10), 4663–4679. https://doi.org/10.1007/s12652-021-03512-2

Chawla

N. V.

Bowyer

K. W.

Hall

L. O.

Kegelmeyer

W. P.

(2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357. https://doi.org/10.1613/jair.953

Cheng

C.-H.

Cai

W.-H.

(2023). Double-weight LDA extracting keywords for financial fraud detection system. Multimedia Tools and Applications, 83(17), 50757–50781. https://doi.org/10.1007/s11042-023-17334-1

Cohen

Felson

(2015). Routine activity theory: A routine activity approach. In Criminology theory (pp. 313–321). Routledge. https://doi.org/10.4324/9781315721781

Coleman

J. W.

(1987). Toward an integrated theory of white-collar crime. American Journal of Sociology, 93(2), 406–439. https://doi.org/10.1086/228750

10.

Cooper

D. J.

Dacin

Palmer

(2013). Fraud in accounting, organizations and society: Extending the boundaries of research. Accounting, Organizations and Society, 38(6–7), 440–457. https://doi.org/10.1016/j.aos.2013.11.001

11.

Davis

J. S.

Pesch

H. L.

(2013). Fraud dynamics and controls in organizations. Accounting, Organizations and Society, 38(6–7), 469–483. https://doi.org/10.1016/j.aos.2012.07.005

12.

Elreedy

Atiya

A. F.

(2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32–64. https://doi.org/10.1016/j.ins.2019.07.070

13.

Feuerriegel

Pröllochs

(2021). Investor reaction to financial disclosures across topics: An application of latent dirichlet allocation. Decision Sciences, 52(3), 608–628. https://doi.org/10.1111/deci.12346

14.

Gabbioneta

Greenwood

Mazzola

Minoja

(2013). The influence of the institutional context on corporate illegality. Accounting, Organizations and Society, 38(6–7), 484–504. https://doi.org/10.1016/j.aos.2012.09.002

15.

Khurana

Koli

Khatter

Singh

(2023). Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82(3), 3713–3744. https://doi.org/10.1007/s11042-022-13428-4

16.

Kleemans

E. R.

Soudijn

M. R. J.

Weenink

A. W.

(2012). Organized crime, situational crime prevention and routine activity theory. Trends in Organized Crime, 15(2–3), 87–92. https://doi.org/10.1007/s12117-012-9173-1

17.

Leukfeldt

E. R.

Yar

(2016). Applying routine activity theory to cybercrime: A theoretical and empirical analysis. Deviant Behavior, 37(3), 263–280. https://doi.org/10.1080/01639625.2015.1012409

18.

Lokanan

(2018). Theorizing financial crimes as moral actions. European Accounting Review, 27(5), 901–938. https://doi.org/10.1080/09638180.2017.1417144

19.

Lokanan

Sharma

(2024). The use of machine learning algorithms to predict financial statement fraud. The British Accounting Review, 56(6), 101441. https://doi.org/10.1016/j.bar.2024.101441

20.

Lokanan

M. E.

(2015). Challenges to the fraud triangle: Questions on its usefulness. Accounting Forum, 39(3), 201–224. https://doi.org/10.1016/j.accfor.2015.05.002

21.

Matthews

(2005). London and County Securities: A case study in audit and regulatory failure. Accounting, Auditing & Accountability Journal, 18(4), 518–536. https://doi.org/10.1108/09513570510609342

22.

Morales

Gendron

Guénin-Paracini

(2014). The construction of the risky individual and vigilant organization: A genealogy of the fraud triangle. Accounting, Organizations and Society, 39(3), 170–194. https://doi.org/10.1016/j.aos.2014.01.006

23.

Natarajan

(Ed.), (2016). Crime opportunity theories: Routine activity, rational choice and their variants. Routledge. https://doi.org/10.4324/9781315095301

24.

Paternoster

Simpson

(1993). A rational choice theory of corporate crime. In Routine activity and rational choice (1st ed., pp. 37–58). Routledge.

25.

Perols

(2011). Financial statement fraud detection: An analysis of statistical and machine learning algorithms. Auditing: A Journal of Practice & Theory, 30(2), 19–50. https://doi.org/10.2308/ajpt-50009

26.

Power

(2013). The apparatus of fraud risk. Accounting, Organizations and Society, 38(6–7), 525–543. https://doi.org/10.1016/j.aos.2012.07.004

27.

Pratt

T. C.

Holtfreter

Reisig

M. D.

(2010). Routine online activity and internet fraud targeting: Extending the generality of routine activity theory. Journal of Research in Crime and Delinquency, 47(3), 267–296. https://doi.org/10.1177/0022427810365903

28.

Schaefer

(2021). Routine activity theory. In Schaefer

(Ed.) Oxford Research Encyclopedia of Criminology and Criminal Justice. Oxford University Press. https://doi.org/10.1093/acrefore/9780190264079.013.326

29.

Shapiro

S. P.

(1990). Collaring the crime, not the criminal: Reconsidering the concept of white-collar crime. American Sociological Review, 55(3), 346. https://doi.org/10.2307/2095761

30.

Steinmetz

K. F.

(2025). Routine activities theory? A kindly critique and a pathway forward. Victims and Offenders, 20(5-6), 1–16. https://doi.org/10.1080/15564886.2025.2476681

31.

Sutherland

(1949). White collar crime. Dryden Press.

32.

Wang

(2018). Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decision Support Systems, 105, 87–95. https://doi.org/10.1016/j.dss.2017.11.001

33.

Williams

M. L.

Levi

Burnap

Gundur

R. V.

(2019). Under the corporate radar: Examining insider business cybercrime victimization through an application of routine activities theory. Deviant Behavior, 40(9), 1119–1131. https://doi.org/10.1080/01639625.2018.1461786

34.

(2025). Is machine learning really effective in detecting corporate fraud? Journal of Accounting Literature. https://doi.org/10.1108/JAL-11-2024-0347

35.

Yar

(2005). The novelty of ‘Cybercrime’: An assessment in light of routine activity theory. European Journal of Criminology, 2(4), 407–427. https://doi.org/10.1177/147737080556056