Exploring ensemble optimized voting and stacking classifiers through Cross-validation for early detection of suicidal ideation

Abstract

Detecting behavioral changes associated with suicidal ideation on social media is essential yet complex. While machine learning and deep learning hold promise in this regard, current studies often lack generalizability due to single dataset reliance. Traditional embedding techniques struggle with semantic analysis,leading to challenges in achieving high accuracy models and conventional validation methods have data drift limitations. To address these challenges, this study proposes a novel evaluation approach using natural language processing across diverse platforms like Twitter and Reddit. By integrating BERT embedding, adept at handling semantic nuances, with an optimized Stacked Classifier combining different base classifiers and XGBoost as the meta-classifier, the model excels in swiftly detecting signs of suicidal ideation compared to the Voting Classifier, i.e., the combination of Decision Tree, Random Forest, Gradient Boost and XGBoost and several machine learning models. Additionally, the study explores advanced embedding techniques like MUSE and LLM, and deep learning models including Bi-LSTM, Bi-GRU, and Text-CNN for comparison.This ensemble approach aims to create a model that is not only interpretable but also robust, reducing computational complexity and enhancing resilience against noisy data—common challenges faced in text classification tasks. Through K-fold validation, which involves partitioning the dataset into k equal-sized subsets or "folds" and training the model k times, using k-1 folds for training and one-fold for testing each time, the proposed model achieves impressive accuracy rates of 97% on Reddit and 96% on Twitter datasets, underscoring its effectiveness in identifying suicidal ideation across social media platforms.

Keywords

Stacked Classifier Voting Classifier MUSE BERT Bi-GRU Bi-LSTM suicidal ideation

1 Introduction

Nowadays, the prevalence of health issues like anxiety and depression is increasing, particularly in developed countries [25]. If it may leave untreated, then these disorders can lead to behavioural changes and severe illness, potentially resulting in suicide attempts [22, 26]. Identifying the reasons and people at risk of suicide is a complex task. People experiencing feelings of depression are more prone to suicide attempts. The factors contributing to suicide can be categorized into health, environment, and personal history, as stated by the American Foundation for Suicide Prevention. Mental health concerns are often associated with a higher risk of suicide [32, 33]. In-depth studies on the psychology of suicide have revealed that personality, cognitive factors, social influences, and negative life experiences play significant roles [30, 31]. Understanding the daily routines of individuals is crucial for predicting behavioural changes and illness. Changes in their life patterns may act as triggers for suicidal thoughts [34].

Suicidal ideation, which arise from feelings of depression and despair, involves contemplating self-destruction. Recognized risk factors like depression, behavioral shifts, and negative emotions are vital signs for potential suicides. Early identification of suicidal thoughts is a global priority, with WHO aiming for a 10% reduction in rates by 2030. Detection involves analyzing writing style or tabular data, utilizing online platforms to monitor and prevent suicide. This pressing issue highlights the need for diagnosing behavioral changes for prevention. Early intervention, alongside advancements in social media and computational linguistics, holds promise for reducing suicide rates[27]. The application of artificial intelligence and machine learning methods enables a better understanding of public intentions, facilitating early intervention [20, 21]. Analyzing social content, including feature engineering, sentiment analysis, and deep learning efforts, plays a vital role in detecting suicidal ideation and advancing current research trends [35, 36]. Prior studies often assessed models using singular datasets, potentially limiting their ability to generalize. Despite the prevalence of embedding techniques like count vectorizer, TFIDF, word2vec, and GloVe, they face limitations such as disregarding semantic meaning, word order, and syntax, and struggling with fixed-size representations and out-of-vocabulary words. The conventional practice of splitting datasets into training, testing, and validation sets aims to evaluate model over-fitting but is flawed due to risks like overfitting or under-fitting with limited data, variability from random set selection, biases from imbalanced datasets, and challenges in addressing data drift over time. This research explores creating an interpretable and robust text classification model by leveraging the benefits of stacking, which mitigates overfitting, combines diverse model strengths, and adapts to various feature spaces. It aims to address challenges like computational complexity and noise resilience, common in text classification tasks. By leveraging BERT’s contextual bidirectional embedding and subword tokenization, the Stacked Classifier, integrating Decision Trees, Random Forests, and Gradient Boosting, along with XGBoost as the meta-classifier, shows improved performance in detecting suicidal ideation, assessed through k-fold cross-validation. Furthermore, it is compared against a Voting Classifier, Bi-LSTM, Bi-GRU, and TEXT-CNN, incorporating various feature extraction techniques like MUSE andLLM.

The significant contributions are summarized as follows:

BERT embedding, leveraging contextual bidirectional modeling and subword tokenization, excels in extracting features for robust semantic representation in sequential patterns.

The Proposed Stacked classifier, surpasses the Voting Classifier and other models, providing interpretability, robustness, reduced computational complexity, and resilience to noisy data, while uncovering long-distance relationships within suicidal posts.

K-fold cross-validation provides a more robust evaluation of the model’s generalization across diverse social media platforms, such as Reddit and Twitter datasets.

The document is structured as follows: Section 2 provides an overview of previous studies on suicide detection on social media. Section 3 describes the methodology, including the proposed ensemble stacked classifier technique and specific approaches used in the model. Section 4 focuses on experimental results, covering data and categorization analysis, along with discussions and study limitations. Lastly, Section 5 concludes the report, outlining limitations and offering recommendations for future research in this field.

2 Related work

Researchers have taken notice of the alarming increase in suicide rates and have introduced various clinical practices, including questionnaire surveys and automated post-detection methods, to identify suicide attempts. Previously, detection relied solely on clinical methods like face-to-face interviews and questionnaires, emphasizing the importance of understanding the psychology and behaviour associated with suicidal ideation through direct interactions [10, 28]. Surveys and interviews with individuals who have attempted suicide have been conducted on platforms such as Weibo and other social media sites [29]. In a notable study, Katchapakirin et al. conducted an experiment on Facebook in 2018 to detect depression within the Thai community [1]. J. Gao et al. developed a model using SVM, Random Forest, and LSTM to detect suicide risk by manually annotating comment datasets on YouTube [2, 19]. Their LSTM model achieved an accuracy of 84.5%. Valeriano et al. were engrossed in detecting suicidal ideation in Spanish-language posts and utilized the Twitter API to extract the dataset [3]. Through human annotations and pre-processing techniques to remove unwanted materials, they converted the dataset into vectors using TF-IDF and Word2Vec. They then applied algorithms such as SVM and Logistic Regression, achieving accuracy rates of 74% and 79%, respectively. These studies demonstrate the potential of leveraging social media data and machine learning algorithms to address the complex issue of suicide detection. A limitation of these study lies in the dataset’s bias, and potential improvement in results may be achieved by using a more balanced distribution of data.

Studies from 2018 to 2023 have delved into suicidal ideation detection employing a range of methodologies [11, 15]. In 2018, Aladag et al. leveraged RF, LR, and SVM on Reddit data, noting a limitation in reliance on a single dataset [4]. Coppersmith et al. combined datasets to predict behavior changes and suicide risks, utilizing Bi-directional LSTM with Self-Attention for a 94% accuracy, though limited by a focus on females aged 18 to 24 [5]. Sawhney et al. explored deep learning with RNN, LSTM, and CNN-LSTM, achieving varying accuracies for suicidal ideation detection [6]. Shing et al. and introduced CNN for this purpose, while S. Ji et al. provided a comprehensive review of diverse methods for predicting suicidal ideation, including clinical practices and deep learning techniques[7 –9]. In subsequent years, Tadesse et al. achieved a superior result of 93% with the LSTM-CNN model[13], Ning et al. proposed a deep learning method with C attention for an 84.3% accuracy [14] and Zepeng et al. utilized FastText and TextCNN models with accuracies of 84.87% and 87.15%, respectively[16]. Akshma et al. achieved a high recall of 94.94% by applying attention over CNN and LSTM to a Reddit dataset[38], while Bhavini et al. utilized a Twitter dataset and a stacked CNN-LSTM model for a 93.92% accuracy[37]. Liu et al. introduced an ensemble model emphasizing feature combination and proposed integrating models like BERT for enhanced multi-classification [17]. Meanwhile, Li Z and Zhou implemented a unique approach involving balanced subset training of base classifiers for improved model refinement[16]. In 2023, Ghosal et al. developed a framework effectively discerning depression-related content and suicidal risk using FastText embeddings, TF-IDF vectorization, and XGBoost, showcasing remarkable results on a Reddit dataset, outperforming baseline models with a 0.78 AUC and a 0.71 weighted F-score [33].

In prior studies, researchers mainly focused on monitoring social media posts to identify changes in people’s behavior and signs of mental health issues. Although past research has made progress with machine learning and deep learning techniques. However, there are still limitations such as reliance on single datasets, biases, overfitting concerns, and the need for better feature extraction to enhance accuracy and minimize classification errors. Furthermore, there is a necessity to develop a model that is both easy to understand and strong, thereby reducing computational complexity and improving its ability to handle noisy data—common challenges encountered in text classification tasks. Thus, to overcome these limitations and introduce novelty into the research, the study used two diverse datasets, Reddit and Twitter, ensuring a nearly balanced distribution. It applied hyperparameter tuning through grid search optimization, addressed overfitting via K-fold cross-validation, and enhanced feature extraction using pre-trained models like BERT which is further compared with MUSE and LLM GPT. The proposed novel ensemble stack classifier model combines Decision Tree, Random Forest, and Gradient Boost as base classifiers, with XGBoost as the meta-classifier with less computation time and perform well in noisy data. These findings highlight significant contributions, providing a potential solution to previous limitations.

3 Methodology

The proposed method for detecting changes in behaviour or signs of suicidal thoughts involves a few steps. It consists of several steps like Data gathering, Data pre-processing, Feature Engineering and prediction of suicidal or non-suicidal posts. The following subsections and Fig. 1 provide a concise overview of each step involved in the present work.

Fig. 1

Work flow.

3.1 Dataset

The data for this study was collected from Kaggle [39], focusing on posts from Reddit to distinguish between suicidal and non-suicidal content. The posts were extracted from two subreddit threads, "Suicide Watch" (Dec 16, 2008, to Jan 2, 2021), and "Depression" (Jan 1, 2009, to Jan 2, 2021), using the Push Shift API. A total of 38,016 posts were gathered. The training dataset constitutes 80% of the total, encompassing 30,412 posts, with 15,164 labeled as related to suicide and 15,248 as non-suicide-related. The remaining 20%, designated as the testing data, is considered authentic and real data directly input into the trained model. The Twitter dataset, sourced from GitHub [40], was curated for identifying suicidal behavior on social media and comprised approximately 9,118 tweets. This dataset was divided into training and testing datasets at an 80:20 ratio, employing a specific random state number. Within the training Twitter dataset, there are 7,295 tweets, of which 3,913 were labeled as indicative of suicidal ideation, while the remaining 4,102 were classified as non-suicidal. Both the Reddit and Twitter datasets showed a balanced distribution, with roughly equal numbers of suicidal and non-suicidal tweets.

3.2 Data preprocessing

Text pre-processing is a crucial step in preparing the dataset for the model. This involves several phases, including text tokenization, removal of stop words, elimination of URLs, punctuation removal, lowercase conversion, lemmatization, and removal of non-English words [17 , 24]. By applying these pre-processing steps [9, 23], the text data is transformed accordingly. The quality of the data plays a vital role in determining the performance of the model. Inconsistent data can significantly impact the model’s performance, underscoring the importance of proper data pre-processing or cleaning.

3.3 Feature extraction

Feature extraction involves converting raw data into numerical format while retaining essential information, crucial for machine learning or deep learning tasks. Count Vectorizer is commonly used to transform text documents into numerical vectors, resulting in a sparse matrix representation. TF-IDF Vectorizer evaluates word importance by assigning higher weights to terms that appear less frequently in the document compared to the entire corpus. N-grams, such as bi-grams (n=2) and tri-grams (n=3), capture sequential text sequences. While increasing the value of n expands feature variety, optimal results were obtained with n=2, generating 717,046 attributes compared to 234,558 with n=3.

BERT is a language representation model that went through pre-training using a large collection of unlabeled text. This dataset covered a range of pre-training tasks. The introduction of BERT was outlined in the paper titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" authored by Devlin et al. in 2018. BERT’s notable characteristic is its strong bidirectionality [12]. It creates thorough bidirectional textual representations during the pre-training phase, successfully integrating context from both preceding and following segments [41, 43]. The study made use of the "bert-base-cased" model, a pretrained model tailored for English text with case sensitivity preserved. This model comprises 12 transformer encoder modules, which process input data in a thorough and hierarchical manner. It features hidden layers of size 768, allowing it to transform each input token into a 768-dimensional vector during computation. With 12 multi-head attention mechanisms, the model comprehensively captures various aspects of information within the input text. In total, the model encompasses 110 million trainable parameters. BERT models are esteemed for their ability to condense text into concise vectors that encapsulate the text’s semantic meaning. This model can accommodate sequences of up to 512 tokens in length. Specifically, when applied to the Reddit dataset, it generated 30,412 sequences, each comprising 512 attributes, and for the Twitter dataset, it produced 7,295 sequences, each with 512 attributes through the BERT extraction process.

Multilingual Unsupervised and Supervised Embeddings (MUSE) stands as an advanced solution within the realm of natural language processing, specifically tailored to overcome the intricate challenges posed by cross-lingual word embeddings [42, 44]. MUSE’s primary objective revolves around the creation of a cohesive vector space, wherein words originating from a multitude of languages find representation, thereby granting the capacity for streamlined language-related tasks and seamless translation processes. The innovation of MUSE lies in its adept amalgamation of unsupervised and supervised learning methodologies. By expertly fusing these approaches, MUSE orchestrates the alignment of words and phrases across disparate languages, forging meaningful semantic connections even when the direct equivalents might be absent. This revolutionary technique serves to not only bolster comprehension across languages but also to propel the efficacy of essential undertakings like machine translation, cross-lingual information retrieval, and sentiment analysis. In entirety, MUSE’s impact resonates profoundly in the sphere of language technology, elevating its potential to new heights amidst the intricacies of diverse linguistic landscapes. The Multilingual Universal Sentence Encoder (MUSE) is a pretrained neural network designed for converting sentences into meaningful vectors. MUSE features a model structure comprising 6 layers, including a hidden layer with 512 units and incorporates 12 heads for multi-headed attention, contributing to its ability to understand diverse sentence contexts. Additionally, it includes a dropout rate of 0.1 to enhance its robustness. This model is composed of approximately 50 million parameters and accommodates sequences of up to 512 tokens in length. Applying the MUSE approach, it derived 30,412 sequences, each consisting of 512 attributes, from the Reddit dataset, and 7,295 sequences with the same attribute length from the Twitter dataset.

The Large Language Model (LLM) stands as a remarkable milestone in the fields of artificial intelligence and natural language processing[45]. Built upon the transformative architecture of Generative Pre-trained Transformer, this model embodies the forefront of technological innovation. Its exceptional ability to comprehend context, generate coherent text, and provide insights across a wide array of subjects positions it as an invaluable asset for various applications, ranging from content synthesis and summarization to language translation and contextual comprehension. The LLM model is essentially a pre-trained GPT-2 model tailored for the English language. Its primary objective is causal language modeling (CLM), predicting what words come next in a sentence. The model is characterized by 768-dimensional embeddings, a Transformer architecture with 12 layers for both encoding and decoding, 12 attention heads for capturing different aspects of context, and GELU activation for nonlinear transformations. This compact GPT-2 variant comprises 124 million parameters and can handle sequences up to 512 tokens in length. When applied to the Reddit dataset, it extracts 30,412 segments, each comprising 512 tokens, and for the Twitter dataset, it extracts 7,925 segments with the same token length.

3.4 Classification model

The decision tree is a well-known machine-learning algorithm utilized for classification and regression purposes [3]. It takes the form of a tree-like model, with internal nodes symbolizing features or attributes, branches representing decision rules based on those features, and leaf nodes denoting the final outcomes or predictions. Data classification is carried out by utilizing the entropy technique, which is represented as shown in equation 1, which denotes pi as the probability of the occurrence. $E (s) = \sum_{i = 1}^{c} - pi {log}_{2} pi .$ (1) Random Forest is a well-known ensemble learning algorithm widely used in machine learning [2]. It leverages the collective power of multiple decision trees to create a robust and accurate predictive model. For data classification, Random Forest utilizes mathematical measures known as the Gini Index and Entropy as shown in equation 2 and 3. These measures play a crucial role in determining the quality of splits during the construction of decision trees within the Random Forest algorithm. $\begin{matrix} Gini impurity = \sum_{n = 1}^{i} Kn (1 - Kn) \end{matrix}$ (2) $\begin{matrix} Entropy = \sum_{n = 1}^{i} - K log (Kn) \end{matrix}$ (3)

Gradient Boosting is a popular machine learning algorithm that combines multiple weak predictive models (often decision trees) to create a strong predictive model [9]. It is based on the principle of iteratively improving the model’s performance by minimizing a loss function. $\begin{matrix} H_{c} (b) = H_{b - 1} (b) + p_{c} q_{c} (b) \end{matrix}$ (4) This model has the capability to reduce bias error. Through the Gradient boosting technique, equation 4 constructs an approximation, referred to as $\hat{H} (b)$ , of the underlying function H^* (b). This function H^* (b) accurately maps examples b to their respective output values z. The given representation can be expressed as a combination of weighted functions. Here, the weight of the c-th function q_c (b) is denoted by p_c.

Extreme Gradient Boosting (XGBoost) is a powerful machine learning algorithm that is known for its exceptional performance and flexibility in solving various regression and classification problems [13]. XGBoost is an advanced implementation of Gradient boosting that incorporates additional enhancements to improve model accuracy and efficiency. By employing a combination of boosting and parallel processing techniques, XGBoost iteratively builds an ensemble of weak prediction models, typically decision trees, to form a robust and accurate final prediction model. XGBoost is an ensemble model that combines multiple classification and regression trees (CART). $\begin{matrix} \hat{A} = \sum_{p = 1}^{p} q_{p} (ml), q_{p} \in Q \end{matrix}$ (5) In the equation 5 mentioned, the variable P represents the total number of trees. The term qp corresponds to the function associated with the p_th tree, which belongs to the functional space Q. Here, Q represents the set encompassing all possible Classification and Regression Trees (CARTs).

The Ensemble Voting Classifier is a machine-learning model that aggregates predictions from multiple individual classifiers to generate a final prediction. Each classifier within the ensemble is trained independently on the same dataset using different algorithms or configurations. The voting classifier then combines the predictions of its constituent classifiers, selecting the final prediction based on either a majority or weighted vote. In a majority vote, the class with the highest number of votes from the classifiers is chosen as the final prediction, while in a weighted vote, each classifier’s vote is weighted based on its performance or confidence level. In this approach, the Ensemble Voting classifier combines predictions from Decision Tree, Random Forest, Gradient Boosting, and XGBoost classifiers. Each classifier predicts a class label, and the final prediction is determined by the majority class among all the classifiers.

The proposed Stacked Classifier combines predictions from various classifiers using a meta-classifier for the final prediction. It operates in two levels: first, individual classifiers are trained independently, and their predictions are collected. Then, these predictions, along with original features, are used to train the meta-classifier, enhancing accuracy as shown in Fig. 2. Stacking classfier achieves robustness against overfitting by training all base models on the same dataset, minimizing the risk of fitting noise. It capitalizes on the unique strengths of each base model, minimizing bias and variance for better generalization, and offers adaptability to diverse feature spaces, facilitating tailored model selection and flexibility. By leveraging different classifiers like Decision Tree, Random Forest, and Gradient Boosting as base classifiers, and XGBoost as the meta-classifier, the model enhances prediction performance and versatility. Each base classifier is trained independently, and their predictions, along with original features, are used to train the meta-classifier XGBoost as illustrated in Algorithm 1. This approach optimally combines the strengths of various classifiers to improve prediction accuracy and overall model performance.

Fig. 2

Proposed Stacked Classifier.

Algorithm 1 Proposed Stacked Classifier
1: procedure EMBEDDING pre-processed data: text, pre-trained language model
2: Initialize the pre-trained language model (e.g., BERT, MUSE, LLM)
3: Tokenize the pre-processed text into tokens or subword units (if necessary)
4: Initialize an empty list to store token embeddings
5: for each token in the tokenized text do
6: Encode the token using the pre-trained language model
7: Extract the token’s embeddings
8: Append the token’s embeddings to the list
9: end for
10: Combine token embeddings to obtain a text-level representation (e.g., pooling, averaging)
11: end procedure
12: procedure DIVIDEDATA
13: [Training + Testing] ← Embedding, Label
14: returnx _ train, x _ test, y _ train, y _ test
15: end Procedure
16: Procedure model1
17: Add Decision Tree as Base classifier $E (s) = \sum_{i = 1}^{c} (- p_{i} {log}_{2} (p_{i}))$
18: Add Random Forest as Base classifier $Entropy = \sum_{n = 1}^{i} (- K log (K_{n}))$
19: Add Gradient Boost as Base classifier H_c (b) = H_b-1 (b) + p_cq_c (b)
20: Add XGBoost as Meta classifier $\hat{A} = \sum_{p = 1}^{p} q_{p} (ml), q_{p} \in Q$
21: Model1← model 1. compile
22: for <number of estimators, learning rate, sample leaf, max depth = Finest parameter (Grid Search)> do
23: Pass the train and test data from the model and generate accuracy
24: end for
25: end Procedure
26: Save model1.h5

3.5 Optimization of model

Grid optimization, also known as grid search, is a technique used to find the optimal hyperparameters for a machine learning or deep learning model by exhaustively searching through a predefined grid of hyperparameter values. Here is the grid optimization done for the models used in stacked classsifier. The base classifier has been designed by the Decision Tree, Random Forest, Gradient Boosting and the meta clasifier as XGBoost classifiers. For each Classifier, a grid of possible values for the respective hyperparameters is defined. The grid search algorithm will then evaluate the performance of the model using different combinations of hyperparameters from the grid. The finest parameter chosen is given in Table 1.

Table 1
Parameter tunned with the grid search

Model Grid parameter Finest parameter

Decision Tree Criterion: Gini, Entropy Entropy

Max_depth: None,5,10,15 10

Max_samples_split:2,5,20 5

Min_samples_leafs:2,5,10 5

Random Forest n_estimators:100,150,200,300 150

Criterion: Gini, Entropy Entropy

Max_depth: None,5,10,15 5

Max_samples_split:2,5 5

Min_samples_leafs:1,2 2

Gradient Boost Learning_rate:0.1,0.001,0.001 0.1

n_estimators:200,400,800,900 800

Max_depth: 3,5,7 5

Subsample:0.8,1.0 1

Min_samples_split:2,5 2

XGBoost Learning_rate:0.1,0.001,0.001 0.1

n_estimators:100,200,300,400,500 500

Max_depth: 3,5,7 5

Subsample:0.8,1.0 1

Colsample_bytree:0.8,1.0 1

Model	Grid parameter	Finest parameter
Decision Tree	Criterion: Gini, Entropy	Entropy
	Max_depth: None,5,10,15	10
	Max_samples_split:2,5,20	5
	Min_samples_leafs:2,5,10	5
Random Forest	n_estimators:100,150,200,300	150
	Criterion: Gini, Entropy	Entropy
	Max_depth: None,5,10,15	5
	Max_samples_split:2,5	5
	Min_samples_leafs:1,2	2
Gradient Boost	Learning_rate:0.1,0.001,0.001	0.1
	n_estimators:200,400,800,900	800
	Max_depth: 3,5,7	5
	Subsample:0.8,1.0	1
	Min_samples_split:2,5	2
XGBoost	Learning_rate:0.1,0.001,0.001	0.1
	n_estimators:100,200,300,400,500	500
	Max_depth: 3,5,7	5
	Subsample:0.8,1.0	1
	Colsample_bytree:0.8,1.0	1

4 Experimental result

The study evaluates the proposed method’s efficacy in identifying suicide-related posts, employing metrics like false positive, false negative, true positive, and true negative. Precision is the ratio of true positives to the total positive predictions, while Recall assesses the method’s ability to correctly identify genuine positives. The F1-score, as harmonic mean of Precision and Recall, offers a comprehensive evaluation. Adopting a macro-average approach assigns equal importance to both classes i.e. suicidal and non-suicidal posts. Precision_macro, Recall_macro, and F1-Score_macro are calculated by averaging values for both classes, providing a balanced assessment of the model’s performance across categories. This collective evaluation includes Precision, Recall, F1-score, and Accuracy.

4.1 Experiment to examined the stacked classifier against voting classifier and other machine learning models

The main objective is to identify alterations in behaviour indicative of suicidal tendencies by analyzing social media posts. This section discusses the use of natural language processing (NLP) for predicting suicidal thoughts, as well as the evaluation of different machine learning techniques to assess their performance. The pre-processed datasets of Twitter and Reddit are used for further process. The feature extracted as vectors are than fed into the several machine learning models like Decision Tree as DT, Random Forest with n-estimator=120 as RF, Bernoulli Naïve Bayes as NB, Gradient Boosting n-estimator=800 as GB, XGBoost with n-estimator=500 XGB, Ensemble Voting Classifier (combination of DT+RF+GB+XGB) and the proposed Stacking Classifier (base classifier: DT, RF, GB and meta Classifier as XGB).

First, the study proceeds by establishing a pipeline structure that integrates Count Vectorizer with N-gram analysis, transformed into TF-IDF representation, enhancing feature engineering. Thus, these features are fed to Decision Tree, Random Forest, Naïve Bayes, Gradient Boosting, Ensemble Voting Classifier, and Stacking Classifier, resulting in accuracies of 87%, 88%, 75%, 89%, 91%, and 93% for the Twitter dataset, and 84%, 89%, 77%, 90%, 91%, and 94% for the Reddit dataset, as shown in Table 2. This experiment concludes that the ensemble Stacking Classifier not only demonstrates superior performance but also offers efficient computation time, making it the best choice in terms of both aspects.

Table 2
Count Vectorizer + Ngram + TFIDF Feature Extraction

Model Datasets Matrices Computation in secs Feature Extraction

Accuracy F-Score Recall Precision

Decision Tree Twitter 0.87 0.87 0.87 0.87 6.103 Count

Reddit 0.84 0.84 0.84 0.84 69.33 Vectorizer

Random Forest Twitter 0.89 0.89 0.88 0.91 79.012 + N-gram

Reddit 0.87 0.87 0.87 0.87 676.962 + TFIDF

Bernoulli Naïve Bayes Twitter 0.78 0.75 0.75 0.86 20.102

Reddit 0.76 0.76 0.76 0.8 242.1234 Twitter

Gradient Boost Twitter 0.9 0.9 0.89 0.91 886.55

Reddit 0.89 0.89 0.89 0.89 4222.539 Dataset

XGBoost Twitter 0.91 0.91 0.91 0.92 32.699 entity=

Reddit 0.9 0.9 0.9 0.91 161.221 7295160187

Ensemble Voting Twitter 0.91 0.91 0.9 0.91 572.974 Reddit

Reddit 0.91 0.91 0.91 0.92 5979.4422 Dataset

Stacking Classifier Twitter 0.93 0.93 0.92 0.93 677.438 entity=

Reddit 0.94 0.94 0.93 0.94 3180.463 30412717046

Model	Datasets	Matrices	Computation in secs	Feature Extraction
Decision Tree	Twitter	0.87	0.87	0.87	0.87	6.103	Count
	Reddit	0.84	0.84	0.84	0.84	69.33	Vectorizer
Random Forest	Twitter	0.89	0.89	0.88	0.91	79.012	+ N-gram
	Reddit	0.87	0.87	0.87	0.87	676.962	+ TFIDF
Bernoulli Naïve Bayes	Twitter	0.78	0.75	0.75	0.86	20.102
	Reddit	0.76	0.76	0.76	0.8	242.1234	Twitter
Gradient Boost	Twitter	0.9	0.9	0.89	0.91	886.55
	Reddit	0.89	0.89	0.89	0.89	4222.539	Dataset
XGBoost	Twitter	0.91	0.91	0.91	0.92	32.699	entity=
	Reddit	0.9	0.9	0.9	0.91	161.221	7295*160187
Ensemble Voting	Twitter	0.91	0.91	0.9	0.91	572.974	Reddit
	Reddit	0.91	0.91	0.91	0.92	5979.4422	Dataset
Stacking Classifier	Twitter	0.93	0.93	0.92	0.93	677.438	entity=
	Reddit	0.94	0.94	0.93	0.94	3180.463	30412*717046

A comprehensive evaluation of advanced and contemporary embedding techniques has been conducted, alongside a diverse array of machine learning models. Various advanced word embedding techniques such as BERT, MUSE, and LLM GPT have been employed to enhance word representations. Additionally, these state-of-the-art embeddings have been integrated into diverse machine learning models including Decision Tree Random Forest (with n-estimators set to 120), Bernoulli Naïve Bayes, Gradient Boosting (with n-estimators set to 800), XGBoost (with n-estimators set to 500), Ensemble Voting Classifier, and Stacking Classifier. Table 3 presents a comparative analysis of the performance of these advanced embedding models like BERT, MUSE, and LLM (GPT), when used in conjunction with various machine learning algorithms. The experimentation was carried out on two consistent datasets: Twitter and Reddit. Among the evaluated models, the Stacking Classifier demonstrated the most remarkable results. Specifically, when combined with BERT embedding, it achieved accuracy rates of 96% and 97% on the Twitter and Reddit datasets respectively. MUSE embedding yielded accuracies of 95% and 96%, while LLM GPT embedding attained 94% and 95% accuracy on the respective datasets. Figures 3, 4, and 5 illustrate the ROC curve of BERT, MUSE, and LLM embeddings with several machine learning models.

Table 3

Embedding BERT, MUSE and LLM with Machine Learning Method

Model	Embedding	Datasets	Matrices	Computation in secs
			Accuracy	F-Score	Recall	Precision
Decision Tree	BERT	Twitter	0.83	0.83	0.82	0.84	12.945
		Reddit	0.84	0.84	0.83	0.85	36.123
	MUSE	Twitter	0.84	0.84	0.84	0.84	10.186
		Reddit	0.85	0.85	0.84	0.86	32.145
	LLM (GBT)	Twitter	0.83	0.83	0.82	0.84	10.391
	Reddit	0.84	0.84	0.83	0.84	33.74
Random Forest	BERT	Twitter	0.89	0.89	0.89	0.89	10.325
		Reddit	0.89	0.89	0.88	0.9	38.421
	MUSE	Twitter	0.9	0.9	0.89	0.9	8.497
		Reddit	0.9	0.9	0.9	0.91	36.124
	LLM (GBT)	Twitter	0.89	0.89	0.88	0.9	12.144
		Reddit	0.9	0.9	0.9	0.9	37.487
Bernoulli Naïve Bayes	BERT	Twitter	0.82	0.82	0.81	0.83	25.327
		Reddit	0.83	0.83	0.82	0.83	68.456
	MUSE	Twitter	0.88	0.88	0.87	0.89	20.383
		Reddit	0.87	0.87	0.88	0.88	67.252
	LLM (GBT)	Twitter	0.85	0.85	0.84	0.85	7.032
		Reddit	0.85	0.85	0.84	0.85	66.259
Gradient Boost	BERT	Twitter	0.89	0.89	0.89	0.9	514.112
		Reddit	0.9	0.9	0.9	0.91	1553.242
	MUSE	Twitter	0.9	0.9	0.9	0.91	344.501
		Reddit	0.91	0.91	0.91	0.91	1245.854
	LLM (GBT)	Twitter	0.9	0.9	0.89	0.91	525.465
		Reddit	0.91	0.91	0.9	0.91	1356.241
XGBoost BERT	Twitter	0.91	0.91	0.9	0.92	163.504
		Reddit	0.91	0.91	0.9	0.91	624.251
	MUSE	Twitter	0.91	0.91	0.91	0.92	107.645
		Reddit	0.92	0.92	0.91	0.92	633.245
	LLM (GBT)	Twitter	0.91	0.91	0.9	0.92	118.74
		Reddit	0.92	0.92	0.91	0.92	636.241
Ensemble Voting	BERT	Twitter	0.91	0.91	0.9	0.92	492.21
		Reddit	0.91	0.91	0.9	0.91	1201.524
	MUSE	Twitter	0.91	0.91	0.91	0.92	313.002
		Reddit	0.92	0.92	0.91	0.92	1137.89
	LLM (GBT)	Twitter	0.9	0.9	0.9	0.9	460.049
		Reddit	0.92	0.92	0.91	0.92	1287.541
Stacking Classifier	BERT	Twitter	0.96	0.96	0.95	0.97	312.412
		Reddit	0.97	0.97	0.96	0.97	821.542
	MUSE	Twitter	0.95	0.95	0.95	0.95	280.124
		Reddit	0.96	0.96	0.95	0.96	789.325
	LLM (GBT)	Twitter	0.94	0.94	0.93	0.95	297.467
		Reddit	0.95	0.95	0.94	0.95	799.251

Fig. 3

Comparision ROC curves with BERT Embedding.

Fig. 4

Comparision ROC curves with MUSE Embedding.

Fig. 5

Comparision ROC curves with LLM Embedding.

4.2 Experiment to examined the stacked classifier against advance deep learning model

Furthermore, the study was expanded to incorporate additional advanced deep learning models, namely Bi-LSTM, Bi-GRU, and Text-CNN, coupled with the advanced embedding techniques BERT, MUSE, and LLM GPT. The comparative results of these advanced deep learning approaches are illustrated in Table 4. According to the Table 4, it is evident that the Stacking classifier requires less computational time than the advanced deep learning model. Consequently, upon evaluating the advanced deep learning models, it can be inferred that the stacking classifier yielded the most favorable outcomes across all methodologies. The grid search optimization technique was utilized to refine the hyperparameters of three distinct deep learning architectures: Bidirectional Long Short-Term Memory (BI-LSTM), Bidirectional Gated Recurrent Unit (Bi-GRU), and Text Convolutional Neural Network (Text-CNN) as shown in Table 5. Each architecture underwent a methodical exploration of various hyperparameter configurations to ascertain the most effective settings for enhancing model performance.For the BI-LSTM model, a range of epochs from 10 to 40 was considered, alongside learning rates spanning 0.001 to 0.1. Diverse batch sizes, units (64 within the LSTM layer), optimizers (Adam, RMSprop, SGD), and dropout rates (including 0.18) were investigated. Similarly, the Bi-GRU architecture underwent a parallel hyperparameter search, encompassing epochs from 10 to 40, learning rates spanning 0.001 to 0.1, and an assortment of batch sizes, units (64 within the GRU layer), optimizers (including Adam), and dropout rates (including 0.18). Conversely, optimization of the Text-CNN model involved epochs spanning 10 to 40, learning rates ranging from 0.001 to 0.1, diverse batch sizes, units (64 within Convolutional and Fully Connected layers), dropout rates (including 0.18), optimizers (with Adam as the chosen option), kernel sizes of 2, 3, and 5, as well as Conv1D filters (including 128).

Table 4
Embedding BERT, MUSE and LLM with Deep Learning Method

Model Embedding Datasets Matrices Computation in secs

Accuracy F-Score Recall Precision

Bi-LSTM BERT Twitter 0.91 0.91 0.9 0.92 3799.12

Reddit 0.92 0.92 0.91 0.92 4587.12

MUSE Twitter 0.9 0.9 0.89 0.9 3829.124

Reddit 0.9 0.9 0.89 0.9 4752.74

LLM (GBT) Twitter 0.89 0.89 0.88 0.9 3800.54

Reddit 0.9 0.9 0.89 0.91 4668.42

Bi-GRU BERT Twitter 0.91 0.91 0.9 0.92 3698.58

Reddit 0.92 0.92 0.91 0.92 4375.12

MUSE Twitter 0.92 0.92 0.91 0.92 3612.24

Reddit 0.92 0.92 0.91 0.92 4784.12

LLM (GBT) Twitter 0.91 0.91 0.9 0.92 3514.27

Reddit 0.91 0.91 0.91 0.91 4534.85

Text CNN BERT Twitter 0.92 0.92 0.91 0.92 3745.47

Reddit 0.93 0.93 0.92 0.93 4454.21

MUSE Twitter 0.91 0.91 0.9 0.92 3712.4

Reddit 0.91 0.91 0.9 0.92 4695.481

LLM (GBT) Twitter 0.92 0.92 0.91 0.92 3754.548

Reddit 0.93 0.93 0.92 0.93 4725.51

Stacking Classifier BERT Twitter 0.96 0.96 0.95 0.96 312.412

Reddit 0.97 0.97 0.96 0.97 821.542

MUSE Twitter 0.95 0.95 0.94 0.95 280.124

Reddit 0.96 0.96 0.95 0.96 789.325

LLM (GBT) Twitter 0.94 0.94 0.93 0.94 297.467

Reddit 0.95 0.95 0.94 0.95 799.251

Model	Embedding	Datasets	Matrices	Computation in secs
Bi-LSTM	BERT	Twitter	0.91	0.91	0.9	0.92	3799.12
		Reddit	0.92	0.92	0.91	0.92	4587.12
	MUSE	Twitter	0.9	0.9	0.89	0.9	3829.124
		Reddit	0.9	0.9	0.89	0.9	4752.74
	LLM (GBT)	Twitter	0.89	0.89	0.88	0.9	3800.54
		Reddit	0.9	0.9	0.89	0.91	4668.42
Bi-GRU	BERT	Twitter	0.91	0.91	0.9	0.92	3698.58
		Reddit	0.92	0.92	0.91	0.92	4375.12
	MUSE	Twitter	0.92	0.92	0.91	0.92	3612.24
		Reddit	0.92	0.92	0.91	0.92	4784.12
	LLM (GBT)	Twitter	0.91	0.91	0.9	0.92	3514.27
		Reddit	0.91	0.91	0.91	0.91	4534.85
Text	CNN	BERT	Twitter	0.92	0.92	0.91	0.92	3745.47
		Reddit	0.93	0.93	0.92	0.93	4454.21
	MUSE	Twitter	0.91	0.91	0.9	0.92	3712.4
		Reddit	0.91	0.91	0.9	0.92	4695.481
	LLM (GBT)	Twitter	0.92	0.92	0.91	0.92	3754.548
		Reddit	0.93	0.93	0.92	0.93	4725.51
Stacking Classifier	BERT	Twitter	0.96	0.96	0.95	0.96	312.412
		Reddit	0.97	0.97	0.96	0.97	821.542
	MUSE	Twitter	0.95	0.95	0.94	0.95	280.124
		Reddit	0.96	0.96	0.95	0.96	789.325
	LLM (GBT)	Twitter	0.94	0.94	0.93	0.94	297.467
		Reddit	0.95	0.95	0.94	0.95	799.251

Table 5

Parameter tunning of Deep Learning Technique

Model	Grid parameter	Finest parameter
BI-LSTM	Learning rate: 0.001,0.01,0.1	0.01
	Unit: 16,32,64,128	64
	Optimizer: Adam, RMSprop, SGD	Adam
	Batch size: 5,10,16,32	16
	Epochs: 10,20,30,40	30
	Drop out: 0.10,0.16,0.18,0.2	0.18
Bi-GRU	Learning rate: 0.001,0.01,0.1	0.01
	Unit: 16,32,64,128	64
	Optimizer: Adam, RMSprop, SGD	Adam
	Batch size: 5,10,16,32	16
	Epochs: 10,20,30,40	20
	Drop out: 0.10,0.16,0.18,0.2	0.18
Text-CNN	Learning rate: 0.001,0.01,0.1	0.01
	Unit: 16,32,64,128	64
	Optimizer: Adam, RMSprop, SGD	Adam
	Batch size: 5,10,16,32	32
	Epochs: 10,20,30,40	20
	Conv1D: 32,64.128,256	128
	Kernel: 2,3,5	5
	Drop out: 0.10,0.16,0.18,0.2	0.18

The experimental study aimed to compare the proposed Stacked Classifier against Voting Classifier and several other machine learning model’s performance. Furthermore, the study also compared the proposed model against deep learning methods for detecting suicidal ideation in social media posts. Various word embedding techniques were employed, including the pipeline of Count Vectorizer, TFIDF, and N-gram and the pretrained model as BERT, MUSE, and LLM GPT, for comparision. The proposed BERT integrated Stacking Classifier model yielded impressive results 96% on Twitter and 97% on the Reddit dataset. MUSE and LLM GPT embeddings also performed well but didnt able to achieve the remarkable result. Overall, the study highlights the effectiveness of Stacking Classifier, particularly when using BERT as feature extraction.

4.3 Cross validation

Cross-validation is a highly reliable technique for assessing model performance. It is particularly useful when working with limited input data. By dividing the data into k folds, the model can be trained and tested on different subsets of the data. The cross-validation score is then calculated to evaluate the model’s effectiveness. This process ensures a thorough and unbiased assessment of the model’s performance without relying on a single training-test split. The Table 6 shows the result of Stacked or Stacking Classifier. The 5-fold cross-validation mean average of the Twitter and Reddit datasets illustrates the Stacked Classifier has given the best performance in the study.

Table 6
K-Fold Cross Validation of Stacking Classifier

Model Cross-validation Twitter dataset Reddit dataset

Stacking Classifier = Fold 1 0.962319 0.97231

DT+ Fold 2 0.961221 0.974231

RF+ Fold 3 0.962394 0.974781

GB Fold 4 0.961435 0.97321

Meta classifier=Xgboost Fold 5 0.96232 0.97231

Mean 0.961938 0.973368

Model	Cross-validation	Twitter dataset	Reddit dataset
Stacking Classifier =	Fold 1	0.962319	0.97231
DT+	Fold 2	0.961221	0.974231
RF+	Fold 3	0.962394	0.974781
GB	Fold 4	0.961435	0.97321
Meta classifier=Xgboost	Fold 5	0.96232	0.97231
	Mean	0.961938	0.973368

4.4 Complexity of the proposed model

The time complexity of the stacking Classifier, where the base classifiers are Decision Tree, Random Forest, and Gradient Boosting, and the meta-classifier is XGBoost, depends on factors such as the number of training instances (n), the number of features (d), the number of base classifiers (m), and the respective time complexities of training each base classifier (Tbase), training the meta-classifier (Tmeta), making predictions using a base classifier (Tpredbase), and making predictions using the meta-classifier (Tpredmeta). The overall time complexity for training the stacking Classifier can be approximated as (m * Tbase) + Tmeta, while the time complexity for making predictions using the trained stacking Classifier can be approximated as (m * Tpredbase) + Tpredmeta.

4.5 Comparision with related work

The study’s performance is influenced by various factors, including hardware considerations and external disturbances, which can hinder direct comparisons with other research findings. When evaluating the proposed model against existing ones, it’s crucial to consider factors such as hardware variations and external noise, which can complicate comparisons. However, the study seeks to create a benchmark for comparison by adopting a comparable dataset and methodology. In a previous study by Bhavini et al. [37], the same Twitter dataset [40] was used with word embeddings as input to a Stacked CNN-LSTM model, achieving an impressive 93.92% accuracy. Akshma et al. utilized the same Reddit dataset, consisting of 20,000 posts, and employed GloVe embeddings as input for an Attention over CNN and LSTM model, reaching an accuracy of 88.48% [38]. In 2023, Ghosal et al. introduced a framework that effectively distinguishes between depression and suicidal risk content. They used a combination of FastText embeddings, TF-IDF vectorization, and the XGBoost classifier, achieving outstanding results with a 0.78 AUC and a 0.71 weighted F-score on a Reddit dataset, surpassing baseline models [33]. In 2019, Tadesse et al. utilized word embeddings as input for the LSTM-CNN model and outperformed, achieving a notable 93% accuracy in the Reddit dataset [19]. In 2021, Ning et al. used word and document embeddings in a deep learning approach with C attention, securing an accuracy rate of 84.3% for subtask 2 on the Twitter dataset [14]. Considering the analysis of related work, the proposed Stacked Classifier, which combines Decision Tree, Random Forest, and Random Forest as base classifiers, with XGBOOST as the meta-classifier, with BERT embedded method achieved an impressive 96% accuracy in the Twitter dataset and 97% on the Reddit dataset, as depicted in Fig. 6. This study can be seen as an extension and advancement of the related work, primarily building upon [37] and [38].

Fig. 6

Comparision of related work.

4.6 Discusion and the limitations

Suicidal ideation, a critical global health concern, demands early detection and intervention. Key indicators like persistent depression and behavioral changes help identify at-risk individuals. Researchers use techniques such as feature engineering and sentiment analysis to progress in this area. However, prior studies on identifying suicidal posts faces limitations like reliance on single datasets, biased data, requiring improved feature extraction additionally, there’s a need for a model balancing simplicity and strength, reducing computational complexity and handling noisy data effectively. This study aims to tackle these challenges to advance understanding and prevention of suicidal ideation. The model’s ability to generalize across diverse social media platforms was a significant finding. Two distinct datasets were used, representing Reddit and Twitter, each with its unique writing styles and linguistic traits. Despite the differences, the model performed consistently well across both datasets, showcasing its generalization capabilities. The Reddit dataset contained 15,164 posts related to suicide and 15,248 non-suicidal posts, while the Twitter dataset included 3,913 tweets indicating suicidal ideation and 4,102 non-suicidal tweets for training. This ensured a balanced distribution for fair training and minimized the risk of biased predictions.

Previous studies commonly utilized embedding techniques like count vectorizer, TFIDF, and word vectorizer, each with limitations. Count vectorizer overlooks semantic meaning, TFIDF disregards word order and syntax, and word embedding methods like word2vec or GloVe face challenges with fixed-size representations and out-of-vocabulary words. This study emphasizes the use of the pre-trained model BERT to overcome these limitations. BERT’s bidirectional contextual analysis and subword tokenization effectively handle out-of-vocabulary words and capture nuanced word meanings and semantic relationships. Sections 4.1 and 4.2 compare BERT with embedding techniques like MUSE and LLM, demonstrating BERT’s superior performance in feature extraction.

Detecting suicidal ideation in text faces challenges like sparse datasets and significant noise. Previous research often relies on deep learning, leading to lengthy computations, high complexity, and may cause potential overfitting on some dataset due to limited data. This ensemble method aims to combat these issues by crafting a model that is both interpretable and resilient. It aims to streamline computations and bolster the model’s ability to handle noisy data, effectively addressing inherent challenges in text classification tasks and to create the robust ensemble model. In section 4.1, the study compares voting and stacking classifiers, showing that stacking, using Decision Tree, Random Forest, and Gradient Boost as base classifiers and XGBoost as the meta-classifier, generally outperforms voting classifiers in capturing complex relationships. Stacking uses a meta-learner to optimize predictions, enhancing adaptability and performance. Voting relies on predetermined rules like majority voting for prediction aggregation. Stacking’s flexibility and optimization distinguish it from the fixed methods of voting classifiers. Additionally, stacking tends to have shorter training times compared to voting classifiers.

The literature often discusses evaluating model overfitting through traditional train-test splits but acknowledges limitations such as data scarcity, random set selection variability, biases from imbalanced datasets, and challenges in addressing data drift. To overcome these limitations, this research employs K-fold cross-validation, dividing the dataset into k subsets for training and testing iteratively. This approach mitigates overfitting or underfitting, maximizes data usage, and ensures consistent model performance. Section 4.3 confirms the model’s effectiveness and absence of overfitting with a mean average of 96.1% on the Twitter dataset and 97.3% on the Reddit dataset, demonstrating its robustness across diverse datasets.

The current study emphasizes minimizing classification errors, building on prior research achievements. Notably, Bhavini et al. (2023) and Akshma et al. (2022) achieved accuracies of 93.92% on Twitter and 88.48% on Reddit datasets, respectively. Ghosal et al. (2023) introduced a framework surpassing baseline models with 78% accuracy on Reddit. Tadesse et al. and Ning et al. also reached noteworthy accuracies of 93% and 84% on Reddit and Twitter datasets, respectively. Utilizing BERT feature extraction, this study achieved remarkable accuracies of 96% on Twitter and 97% on Reddit datasets, as detailed in Section 4.5.

The study’s limitation lies in its exclusive focus on text, overlooking behavioral cues from other formats like images, videos, audio clips, or emojis. Non-verbal cues and visual or auditory elements often convey emotions or behavioral changes not captured through text alone. User engagement frequency, posting patterns, and social interactions could provide valuable insights into behavior shifts and emotional well-being. Sole reliance on text analysis restricts findings, potentially missing subtle nuances in human interaction. Future research should integrate diverse data types and external contextual information to better understand behavior changes, disease markers, and mental health indicators on social media platforms.

5 Conclusion

The proposed model’s goal is to reduce the suicide rate in society by accurately detecting signs of suicidal intent in social media conversations. By analyzing behaviour changes and disease symptoms through social media data, this model aims to identify individuals at risk and provide timely medical care or assistance, which is crucial for effective suicide prevention. Unfortunately, ordinary life events and hardships can now have devastating consequences for individuals, their families, and their friends. Therefore, there is an urgent need for a reliable model that can effectively recognize and address posts related to suicide on social media. The proposed model, called the Stacked Classifier, combines several algorithms (Decision Tree, Random Forest, Gradient Boost) as base classifiers and employs XGBoost as the meta classifier to extract relevant information, achieving impressive accuracy rates of 97% for the Reddit dataset and 96% for the Twitter dataset.

The study’s limitation lies in its exclusive focus on text-based analysis, overlooking valuable cues from alternative formats such as images, videos, audio clips, or emojis. This hampers the understanding of non-verbal and contextual information crucial for discerning behavior and mental health indicators on social media. To address this, future research should integrate diverse data types and employ feature selection techniques to classify individuals with suicidal tendencies better. Investigating correlations between suicidal behavior and factors like family environment is essential, alongside leveraging advanced methodologies like FastText word embedding and diverse machine learning models for improved classification across datasets.

References

Katchapakirin

, Wongpatikaseree

, Yomaboot

and Kaewpitakkun

, Facebook social media for depression detection in the Thai community. In 2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE) (2018), pp. 1–6. IEEE.

Gao

, Cheng

and Yu

P.L.

, Detecting comments showing risk for suicide in YouTube. In Proceedings of the Future Technologies Conference (2018, November), (pp. 385–400). Springer, Cham.

Valeriano

, Condori-Larico

and Sulla-Torres

, Detection of suicidal intent in Spanish language social networks using machine learning, International Journal of Advanced Computer Science and Applications 11(4) (2020).

Aladağg

A.E.

, Muderrisoglu

, Akbas

N.B.

, Zahmacioglu

and Bingol

H.O.

, Detecting suicidal ideation on forums: proof-of-concept study, Journal of Medical Internet Research 20(6) (2018), e9840.

Coppersmith

, Leary

, Crutchley

and Fine

, Natural language processing of social media as screening for suicide risk, Biomedical Informatics Insights 10 (2018), 1178222618792860.

Sawhney

, Manchanda

, Mathur

, Shah

and Singh

, Exploring and learning suicidal ideation connotations on social media with deep learning. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (2018), pp. 167–175.

Zhao

, Lin

and Huang

, Text Classification of Micro-blog’s” Tree Hole” Based on Convolutional Neural Network. In Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence (2018), pp. 1–5.

Shing

H.C.

, Nair

, Zirikly

, Friedenberg

, Daume III

and Resnik

, Expert, crowdsourced, and machine assessment of suicide risk via online postings. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic (2018), pp. 25–36.

, Yu

C.P.

, Fung

S.F.

, Pan

and Long

, Supervised learning for suicidal ideation detection in online user content, Complexity 2018 (2018).

10.

Hevia

A.G.

, Menendez

R.C.

and Gayo-Avello

, Analyzing the use of existing systems for the clpsych shared task. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (2019), pp. 148–151.

11.

Morales

, Dey

, Theisen

, Belitz

and Chernova

, An investigation of deep learning systems for suicide risk assessment. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (2019), pp. 177–181.

12.

Matero

, Idnani

, Son

, Giorgi

, Vu

, Zamani

and Schwartz

H.A.

, Suicide risk assessment with multi-level dual-context language and BERT. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology (2019), pp. 39–44.

13.

Tadesse

M.M.

, Lin

, Xu

and Yang

, Detection of suicide ideation in social media forums using deep learning, Algorithms 13(1) (2019), 7.

14.

Wang

, Luo

, Shivtare

, Badal

V.D.

, Subbalakshmi

K.P.

, Chandramouli

and Lee

, Learning models for suicide prediction from social media posts. arXiv preprint arXiv:2105.03315. (2021).

15.

Sawhney

, Joshi

, Shah

and Flek

, Suicide ideation detection via social and temporal user representations using hyperbolic learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021), pp. 2176–2190.

16.

, Zhou

, An

, Cheng

and Hu

, Deep hierarchical ensemble model for suicide detection on imbalanced social media data, Entropy 24(4) (2022), 442.

17.

Liu

, Shi

and Jiang

, Detecting suicidal ideation in social media: an ensemble method based on feature fusion, International Journal of Environmental Research and Public Health 19(13) (2022), 8197.

18.

Zhang

, Schoene

A.M.

, Ji

and Ananiadou

, Natural language processing applied to mental illness detection: a narrative review, NPJ Digital Medicine 5(1) (2022), 1–13.

19.

Hochreiter

, Ja1 4 rgen schmidhuber. Long shortterm memory, Neural Computation 9(8) (1997).

20.

Renjith

, Abraham

, Jyothi

S.B.

, Chandran

and Thomson

, An ensemble deep learning technique for detecting suicidal ideation from posts in social media platforms, Journal of King Saud University-Computer and Information Sciences (2021).

21.

Haque

, Nur

R.U.

, Al Jahan

, Mahmud

and Shah

F.M.

, A transformer based approach to detect suicidal ideation using pre-trained language models. In 2020 23rd International Conference on Computer and Information Technology (ICCIT) (2020) (pp. 1–5). IEEE.

22.

Abdulsalam

and Alhothali

, Suicidal Ideation Detection on Social Media: A Review of Machine Learning Methods, (2022), arXiv preprint arXiv:2201.10515.

23.

Huang

, Li

, Liu

, Chiu

, Zhu

and Zhang

, Topic model for identifying suicidal ideation in chinese microblog. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015), pp. 553–562.

24.

Cao

, Zhang

, Feng

, Wei

, Wang

, Li

and He

, Latent suicide risk detection on microblog via suicideoriented word embeddings and layered Attention. arXiv preprint arXiv:1910.12038. (2019).

25.

, Zhang

, Luo

, Jia

, Wei

, Tao

and Xu

, Extracting psychiatric stressors for suicide from social media using deep learning, BMC Medical Informatics and Decision Making 18(2) (2018), 77–87.

26.

Skaik

and Inkpen

, Using social media for mental health surveillance: a review, (CSUR) 53(6) (2020), 1–31.

27.

Just

M.A.

, Pan

, Cherkassky

V.L.

, McMakin

D.L.

, Cha

, Nock

M.K.

and Brent

, Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth, Nature Human Behaviour 1(12) (2017), 911–919.

28.

Lotito

and Cook

, A review of suicide risk assessment instruments and approaches, Mental Health Clinician 5(5) (2015), 216–223.

29.

Tan

, Liu

, Cheng

and Zhu

, Designing microblog direct messages to engage social media users with suicide ideation: interview and survey study on Weibo, Journal of Medical Internet Research 19(12) (2017), e8729.

30.

Varathan

K.D.

and Talib

, Suicide detection system based on Twitter. In 2014 Science and Information Conference (2014). pp. 785–788. IEEE.

31.

Gunn

J.F.

and Lester

, Twitter postings and suicide: An analysis of the postings of a fatal suicide in the 24 hours prior to death, Suicidologi 17(3) (2012).

32.

Colombo

G.B.

, Burnap

, Hodorog

and Scourfield

, Analyzing the connectivity and communication of suicidal users on twitter, Computer Communications 73 (2016), pp. 291–300.

33.

Ghosal

and Jain

, Depression and Suicide Risk Detection on Social Media using fastText Embedding and XGBoost Classifier, Procedia Computer Science 218 (2023), pp. 1631–1639, ISSN 1877–0509, https://doi.org/10.1016/j.procs.2023.01.141.

34.

Braithwaite

S.R.

, Giraud-Carrier

, West

, Barnes

M.D.

and Hanson

C.L.

, Validating machine learning algorithms for Twitter data against established measures of suicidality, JMIR Mental Health 3(2) (2016), e4822.

35.

Walsh

C.G.

, Ribeiro

J.D.

and Franklin

J.C.

, Predicting risk of suicide attempts over time through machine learning, Clinical Psychological Science 5(3) (2017), 457–469.

36.

Bhat

H.S.

and Goldman-Mellor

S.J.

, Predicting adolescent suicide attempts with neural networks. arXiv preprint arXiv:1711.10057. (2017).

37.

Priyamvada

, Singhal

, Nayyar

, et al., Stacked CNN - LSTM approach for prediction of suicidal ideation on social media, Multimed Tools Appl (2023).

38.

Chadha

and Kaushik

, A Hybrid Deep Learning Model Using Grid Search and Cross-Validation for Effective Classification and Prediction of Suicidal Ideation from Social Network Data, New Gener. Comput. 40 (2022), 889–914.

39.

Kaggale: https://www.kaggle.com/datasets/nikhileswarkomati/suicidewatch

40.

Github: https://github.com/laxmimerit/twitter-suicidalintention-dataset

41.

Del Gobbo

, Guarino

, Cafarelli

, et al., GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation, Knowl Inf Syst 65 (2023), 4295–4433. https://doi.org/10.1007/s10115-023-01892-9

42.

Guarino

, Malandrino

and Zaccagnino

, An automatic mechanism to provide privacy awareness and control over unwittingly dissemination of online private information, Computer Networks 202 (2022), 108614, ISSN 1389-1286. https://doi.org/10.1016/j.comnet.2021.108614.

43.

Radford

, Wu

, Child

, Luan

, Amodei

and Sutskever

, Language Models are Unsupervised Multitask Learners, (2019).

44.

Devlin

, Chang

, Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, North American Chapter of the Association for Computational Linguistics (2019).

45.

Yang

Y.C.

, Ahmad

, Amin Guo

, Law

, Constant

, Abrego

, Yuan

, Tar

, Sung

Y-H.

, Strope

and Kurzweil

, Multilingual Universal Sentence Encoder for Semantic Retrieval. (2019).