BehavGLM: A graph-enhanced language model for unsupervised student campus behavior clustering

Abstract

The increasing availability of fine-grained student behavior data on smart campuses offers significant opportunities for personalized education. However, traditional clustering methods applied to such structured data often fail to capture the semantic complexity of behavioral features and the relational dependencies among individuals. To address these dual challenges, we propose a deep unsupervised clustering framework that integrates Bidirectional Encoder Representations from Transformers (BERT) and Graph Attention Networks (GAT). Recognizing that raw numerical features lack contextual depth, our approach first transforms structured data into natural language profiles, leveraging a pretrained BERT model to extract semantically rich embeddings. These individual representations are situated within a student behavior graph, where a GAT module refines node features by capturing relational structures and inter-student similarities. The combined embeddings enhance the performance of multiple clustering algorithms in identifying distinct behavioral patterns across students. In addition, we introduce a hierarchical anomaly detection module that identifies both unstable behavior clusters and outlier individuals based on intra-cluster variance and local density, providing a solution for detecting anomalous patterns in student populations. Experimental results on real-world campus datasets demonstrate the framework’s effectiveness, while further analysis highlights its practical utility in uncovering early indicators of academic risk through interpretable behavioral modeling.

Keywords

Student behavior modeling unsupervised clustering BERT graph attention networks educational data mining

1. Introduction

With the development of smart campus infrastructure, universities are now equipped with the ability to continuously collect fine-grained student behavior data across locations and time periods. Digital systems such as campus card platforms, camera recording, and learning management tools enable the logging of student consumption, mobility, and academic engagement. The student campus behavior data comprehensively reflect students’ on-campus routines, effort levels, and spatial-temporal behavior, forming a multi-dimensional representation of their campus life and learning status. They provide a crucial foundation for understanding student dynamics and open up new possibilities for precision management, risk early warning, and personalized education.^1–3

In the current field of education, student campus behavior data can support a wide range of analytical tasks, including correlation analysis, classification, and clustering. Correlation analysis approaches have quantified associations between behavioral indicators and academic performance, revealing significant behavioral dimensions predictive of student outcomes.^4,5 Classification models have been developed to identify engagement levels and behavior categories based on structured activity logs.^6–8 Clustering techniques have been widely employed to explore latent behavioral structures in student activity data, enabling the discovery of common engagement patterns and lifestyle archetypes.^9,10 These studies collectively highlight the analytical potential of behavior data for data-driven education management. Among these tasks, clustering as a representative unsupervised learning technique, provides a flexible and scalable alternative that enables the discovery of latent behavioral structures and anomalies without requiring predefined labels. This makes it especially valuable for early-stage behavioral analytics and continuous monitoring in real-world campus environments, where labeled data is often scarce or incomplete.

Despite this progress, two major challenges remain underexplored in the current achievements. First, student behavior data collected from campus systems are inherently structured as tabular data, where each feature column carries semantic meaning that is closely tied to the column header and contextual value. However, existing studies typically treat these features as independent numeric inputs, overlooking the implicit semantic relationships between column names and data values. Second, there exist strong inter-student behavioral correlations that stem from shared routines, peer influence, or institutional schedules. These latent structural dependencies are rarely captured, as most models operate under the assumption of sample independence and fail to incorporate the topological behavior similarity among students.

The rapid advances in large language models (LLMs) offer a new paradigm for understanding tabular behavior data. By converting each row of behavior records into natural language descriptions, LLMs can be employed to generate contextual embeddings that encode both the semantic content and feature-label relationships, thereby addressing the semantic gap between tabular headers and their values. In parallel, the development of graph neural networks (GNNs) has provided effective tools for modeling relational data. Specifically, constructing a student similarity graph based on behavioral embeddings enables the learning of interdependent patterns across students, allowing the model to leverage group structures and peer similarities that are otherwise lost in traditional approaches. Moreover, while many prior works rely on supervised learning frameworks with labeled academic outcomes, such labels (e.g., GPA, dropout status) are often unavailable, delayed, or insufficient to reflect the nuanced and evolving nature of student behavior.

We propose BehavGLM, a novel clustering framework that integrates Bidirectional Encoder Representations from Transformers¹¹ and Graph Attention Networks¹² for student behavior modeling (Figure 1). BehavGLM first transforms structured behavioral features into standardized natural language descriptions and leverages a pretrained Bidirectional Encoder Representations from Transformers (BERT) model to extract semantic embeddings that encode temporal patterns, engagement intensity, and cross-feature interactions. A student graph is then constructed based on semantic similarity, upon which Graph Attention Networks (GAT) is applied to refine the embeddings by incorporating relational signals. This framework enables the joint modeling of individual behavioral semantics and the latent structural relationships within the student population.

Figure 1.

BehavGLM jointly models semantic and structural behavior information for unsupervised clustering and anomaly detection.

Our main contributions are as follows:

We introduce a semantic transformation module that maps structured tabular behavior data into natural language form, and encodes it using a pretrained BERT model to capture fine-grained behavioral semantics.

We construct a student similarity graph based on semantic proximity and apply GAT to model inter-student behavioral dependencies, thereby enhancing representation learning with relational context.

BehavGLM enhances clustering performance across multiple mainstream algorithms. In addition, the framework enables the detection of behavioral anomalies closely linked to academic risk, demonstrating its practical value in educational applications.

2. Related work

2.1. Analysis of student behavior

Campus-sensing data, such as smart card records, have been widely used to infer latent behavioral traits such as diligence, orderness, and sleep regularity, which are strongly associated with academic performance.¹³ Beyond statistical associations, short-term sequences of campus activity offer fine-grained temporal features for capturing fluctuations in students’ cognitive and behavioral states, aiding in performance change prediction.¹⁴ And the digital engagement exerts measurable effects on learning outcomes. For example, social media overload has been shown to impair academic performance by inducing cognitive fatigue through stimulus–organism–behavior–consequence mechanisms.¹⁵ On the predictive side, behavioral features extracted from learning management systems (LMS), along with demographic and historical academic data, have been effectively used to build machine learning models for early detection of at-risk students.¹⁶ Recent hybrid deep learning advances have further improved behavioral profiling and anomaly detection in educational contexts. Among these, the TSA-GRU model integrates temporal sparse attention with GRU to mine sequential learning behaviors and identify anomalous student interactions,¹⁷ and the Learning Style Decoder framework combines psychological learning style metrics with deep neural networks to construct personalized, fine-grained learner profiles.¹⁸ In parallel, other studies emphasized behavior similarity among peers and proposed multi-task learning frameworks to jointly model inter-semester and inter-major relations using smart card data, achieving robust performance in large-scale academic prediction tasks.¹³

Although these studies have provided useful empirical insights, many rely on manually engineered features and conventional statistical or shallow machine-learning pipelines. Such approaches can identify coarse correlations but are limited in their ability to exploit the full complexity of campus-sensing data.

2.2. Semantic representation learning for tabular data

Recent advances in modeling structured tabular data have shifted from static feature engineering toward semantic-aware representation learning. Transformer-based architectures, such as TabTransformer, have been introduced to model high-order dependencies between categorical and numerical features via self-attention, yielding more contextualized embeddings and improved classification performance.¹⁹ Complementing this, TabNet adopts a sequential attention mechanism combined with sparse feature selection to dynamically prioritize informative feature subsets, offering both strong predictive accuracy and interpretability in heterogeneous tabular domains.²⁰ Moreover, encoder-only transformer architectures have proven effective for user behavior profiling and complex sequential pattern mining across diverse application scenarios,²¹ with the inherent self-attention mechanism enabling robust extraction of hidden behavioral features even from structured tabular inputs. In addition to discriminative and attention-based schemes, generative and multimodal approaches have also emerged to further capture latent semantics and structural regularities. TabDiff leverages diffusion processes to model joint distributions over mixed-type features, effectively learning both continuous and discrete dimensions of tabular data in a unified generative framework.²² Meanwhile, LLM-based models like TableGPT2 integrate structured rows and columns with unstructured textual context, enabling generalized reasoning and semantic understanding through multimodal embeddings.²³ Extending this line of research, TableLLM enables direct manipulation of spreadsheet-like data using LLMs, highlighting the potential of semantic modeling in practical human-computer interaction scenarios.²⁴

These advances represent a shift toward semantically enriched and context-aware modeling, opening new opportunities to capture complex patterns in structured student behavior data. By leveraging these methods, it becomes possible to learn behavior-sensitive representations that effectively reflect both the semantic meaning and structural dependencies within student activity data.

2.3. Graph neural networks for relational behavior modeling

Graph Neural Networks (GNNs) have been increasingly employed to model complex relational structures inherent in behavioral data. To capture dynamic multi-relational interactions, recurrent GNN architectures utilize relation-specific message passing combined with recurrent updates, enabling temporal pattern modeling in heterogeneous graphs.²⁵ Building on the concept of heterogeneous edge types, multi-behavior GNN models represent diverse user actions as distinct relations within a unified graph structure, facilitating richer interaction modeling across behavior categories.²⁶ Complementing these approaches, methods for transforming structured tabular data into weighted graphs enable the preservation of both semantic features and relational proximity, thus providing a principled foundation for applying GNNs to traditionally structured datasets.²⁷ In line with this trend, hybrid temporal-graph models have also been developed for longitudinal student behavior profiling and anomaly detection, further bridging temporal behavioral dynamics and graph-based relational learning.²⁸ Together, these advances illustrate a cohesive methodological progression from temporal and spatial relational modeling toward integrated graph construction from structured behavioral records, underscoring the suitability of GNNs for capturing multifaceted student behavior patterns.

Graph-based methods have advanced the modeling of complex behaviors by effectively capturing spatial, temporal, and multi-relational dependencies. These methods are well-suited for modeling social groups like students.

3. Data preprocessing

The data used in our work was obtained from the campus card system logs provided by a university in Beijing, covering the campus behavior data of 12,834 undergraduate students during a specific semester. The original data was anonymized during the export process, including the removal of sensitive information such as student names and student ID numbers, to protect student privacy. The data types covered multidimensional daily behavior records, providing a foundation for subsequent modeling and analysis.

3.1. Dataset

The dataset used in this study primarily includes the following three aspects:

Campus Gate Access Data: records information about students entering and exiting the campus through campus gate turnstiles, including turnstile name, entry and exit direction, and timestamps for entering and exiting the campus gate. This type of data reflects students’ frequency of leaving campus, distribution of travel times, and patterns of time spent on campus.

Library Behavior Data: includes students’ swipe records when using the campus library, including entry and exit direction, channel type, and timestamps for entering and exiting the library. By analyzing these records, it is possible to characterize students’ learning behavior patterns.

Dining Behavior Data: covers students’ swipe records when dining at campus canteens, including dining locations, consumption amounts, consumption types, and consumption times. Such data can be used to uncover students’ dietary habits, meal regularity, and daily rhythms.

In summary, the dataset is relatively comprehensive in terms of dimensions, covering key aspects of students’ daily life and learning behaviors during their time on campus. This facilitates in-depth analysis of individual behavioral patterns and potential anomalies, laying a solid foundation for subsequent clustering modeling and anomaly detection.

3.2. Feature engineering

3.2.1. Feature extraction

After basic data cleaning, which included the removal of duplicate records, imputation of meaningful missing values, and correction of formatting inconsistencies, feature engineering was conducted to convert raw, low-level behavioral traces into structured and interpretable high-level representations suitable for downstream analysis. As campus behavior data are inherently sparse and unstructured, statistical aggregation and behavior-specific modeling techniques were applied. These methods enabled the extraction of meaningful features from core behavioral domains, as detailed in Tables 1 –3.

Table 1.
Campus gate access features.

Feature name Feature description

ZJ_JC Turnstile entry-exit count

ZJ_maxt Maximum entry duration

Feature name	Feature description
ZJ_JC	Turnstile entry-exit count
ZJ_maxt	Maximum entry duration

Table 2.

Library behavior features.

Feature name	Feature description
TSG_JC	Library access count
TSG_JRavg	Average daily entry count
TSG_JRmtr	Entry time entropy
TSG_JRstd	Entry time standard deviation
TSG_max	Maximum stay duration
TSG_CRavg	Average daily exit count
TSG_CRmtr	Exit time entropy
TSG_CRstd	Exit time standard deviation

Table 3.

Dining behavior features across breakfast, lunch, and dinner.

Feature prefix	Name	Feature description
bf_/ln_/dn_	entro	Meal time entropy
bf_/ln_/dn_	avg	Mean meal time
bf_/ln_/dn_	mode	Mode meal time
bf_/ln_/dn_	rag	Meal time range
bf_/ln_/dn_	perc	First quartile of meal time
bf_/ln_/dn_	max	Maximum meal time
bf_/ln_/dn_	med	Median meal time
bf_/ln_/dn_	std	Meal time standard deviation

Note: Prefixes bf_, ln_, and dn_ indicate breakfast, lunch, and dinner.

3.2.2. Semantic transformation and feature fusion

Although the extracted behavioral features are numerical and structured, many of them implicitly reflect contextual information that is better expressed in natural language. To leverage the powerful semantic understanding of large language models, we designed a transformation process that converts structured numerical records into standardized textual profiles for each student. This textual transformation serves as a bridge between conventional feature engineering and deep semantic modeling, enabling the integration of statistical patterns and contextual knowledge. Each student is associated with a narrative-style behavior profile that summarizes key traits across three dimensions: campus consumption, library usage, and campus gate activity. These profiles are constructed using rule-based sentence templates that map each numerical feature into a semantically informative and grammatically consistent statement.

The design of text templates follows a selection process informed by exploratory experiments. Given the vast design space for natural language prompting, we evaluated several template styles, including simplified numerical listings (feature-value pairs), narrative behavioral descriptions, and temporally-focused patterns. Our preliminary evaluations indicated that the narrative-style template achieved superior performance in capturing semantic relationships between behavioral features. This can be attributed to the fact that narrative descriptions align better with the distribution of text data that BERT was pre-trained on, thereby better leveraging the model’s contextual understanding capabilities. An example template is shown :“The student has …campus entries. The maximum time spent on campus is …minutes. The student entered the library on …days. The mean time of entering the library is around …”.

4. Method

Modeling complex student behavior from structured campus data poses significant challenges. Raw behavioral attributes, such as meal times, library visits, and gate access frequencies, encode intricate temporal and contextual patterns that are not easily captured by traditional statistical or distance-based clustering methods. Moreover, students often display latent group structures or behavioral correlations, where similar routines may arise from shared social or environmental influences. Conventional approaches typically overlook both the internal semantics of individual behavior and the relational dependencies among students, limiting their capacity for expressive representation and robust generalization.

In this work, BehavGLM as a modular framework is designed to address these limitations through the combination of semantic modeling and graph-based relational learning. The approach begins by transforming each student’s structured behavioral record into a natural language profile, which is then encoded using a pretrained BERT model to extract deep semantic representations. To capture structural relationships among students, a similarity graph is constructed in the embedding space, where nodes represent students and edges denote behavioral proximity. These graph-structured embeddings are further refined via a GAT, which dynamically aggregates contextual information from neighbors to enhance behavior-aware representations. This process enables BehavGLM to capture both individual-level semantic detail and population-level interaction patterns, providing a more comprehensive foundation for clustering and anomaly detection.

4.1. Model structure

BehavGLM consists of three main components: (1) a semantic embedding module leveraging pretrained BERT¹¹ to transform structured behavior data into textual and then vectorized form; (2) a graph-based relational modeling component, where a student similarity graph is first constructed in the semantic space using nearest-neighbor relations, and a GAT-based encoder¹² is then applied to learn context-aware embeddings that capture inter-student behavioral dependencies. The pipeline is illustrated in Figure 2. This framework aims to address the challenges of representing heterogeneous and high-dimensional behavioral features and capturing implicit similarities or interactions between students based on their daily patterns.

Figure 2.

Overview of our BehavGLM framework.

Let $X \in R^{N \times d}$ denote the feature representation of $N$ students, where $d$ includes both semantic and statistical dimensions. The goal is to learn a representation $Z \in R^{N \times d^{'}}$ , where each row $z_{i}$ encodes both the semantic information and graph structural signals associated with student $i$ . These embeddings are then used for clustering to identify underlying behavior groups.

4.2. Semantic embedding using BERT

To obtain high-dimensional semantic representations of student behavior, we employ a pretrained BERT model to encode the textual profiles generated in Section 3.2. The goal is to extract contextualized embeddings that capture both intra-student behavior consistency and inter-feature semantic dependencies. We choose BERT as our text encoder due to its strong contextual representation capability and mature open-source implementation. Compared with classical text encoding methods (e.g., TF-IDF, Doc2Vec), BERT can better capture the semantic information of campus behavior data, as validated by our comparative experiments (Section 5.1).

Each textual profile is first tokenized using the standard BERT tokenizer, preserving word boundaries and subword semantics. The tokenized sequence is then passed through the pretrained BERT-base model. We extract the hidden states from the final transformer layer and compute the average across all token embeddings to obtain a fixed-length sentence-level representation for each student.

The semantic embedding for each student is computed as:

e_{i} = \frac{1}{T} \sum_{t = 1}^{T} h_{t}^{(i)}

(1)

where

h_{t}^{(i)} \in R^{768}

denotes the hidden representation of token

t

for student

i

and

T

is the sequence length after tokenization. This produces a fixed-size embedding

e_{i} \in R^{768}

for each student.

These embeddings capture rich behavioral semantics from the textualized student profiles. On a local level, the representations encode the meaning of individual behavioral attributes, such as breakfast regularity and library visit times, using token-wise attention mechanisms. On a global level, they encode how different behaviors interact contextually within the profile, e.g., the alignment of evening library visits with late dining patterns. This level of semantic abstraction would be difficult to capture using raw numerical features alone.

Importantly, the textual profiles fed into BERT are derived from structured statistical data using a controlled natural language template. The textualized behavioral descriptions follow relatively regular syntactic and lexical patterns, which align well with the inductive biases learned during BERT’s pretraining. This structural consistency potentially facilitates better semantic encoding across diverse behavioral profiles.

To complement the textual representation, the original normalized numerical features are retained and later fused with the semantic embeddings during downstream modeling, resulting in a hybrid representation:

x_{i} = [e_{i} ∥ {\hat{η}}_{i}]

(2)

where

{\hat{η}}_{i} \in R^{m}

denotes the standardized numerical feature vector and

∥

represents vector concatenation. This enriched representation

x_{i} \in R^{768 + m}

combines contextual semantics and raw statistical descriptors in a unified format.

Compared to conventional tabular modeling approaches, which typically rely on decision trees or shallow neural networks,^29,30 the hybrid embedding leverages both global statistical patterns and token-level semantic cues. This duality makes it particularly effective in capturing nuanced behavior such as long-term regularity, cross-feature consistency, and outlier patterns that may not be evident from numerical values alone.

The fused embeddings are subsequently used as node features in the graph neural network architecture described in Section 4.3, where relational patterns between students can be further explored. By grounding each node in both deep semantic and quantitative behavioral space, the model gains a more holistic understanding of student behavior, facilitating more accurate clustering and anomaly detection in later stages.

4.3. Relational modeling using GAT

While the hybrid feature vectors $x_{i}$ derived in Section 4.2 encode comprehensive information for each individual student, they do not explicitly capture relationships between students, such as shared dining habits, similar library usage patterns, or group-level behavior regularities. To model such relational dependencies, we construct a graph-based structure that allows students to exchange information through a message-passing mechanism. This is achieved by building a student graph and applying a Graph Attention Network (GAT) to learn context-aware node representations.

4.3.1. Graph construction

We first construct a directed graph $G = (V, E)$ ,where each node $v_{i} \in V$ represents a student, and each edge $(i, j) \in E$ denotes a directed behavioral similarity connection from student $i$ to student $j$ . The edge connections are not given a priori, but are constructed using a $k$ -Nearest Neighbors (kNN)³¹ strategy in the semantic space of BERT embeddings ${e_{i}}_{i = 1}^{N}$ .These embeddings capture the semantic content of behavioral descriptions and serve as a natural metric space to define student similarity.

We rely on BERT embeddings, rather than fused vectors that include numerical statistics, ensuring that connections reflect semantic homophily. Since BERT is trained to model linguistic regularities, it effectively captures common patterns such as routine activities or shared preferences expressed in text. By contrast, incorporating numerical features at this stage might introduce noise or misleading closeness based on surface-level statistical proximity, which does not necessarily imply meaningful behavioral similarity.

Concretely, for each student $i$ , we compute the Euclidean distance between $e_{i}$ and all other embeddings $e_{j}$ and identify the top- $k$ closest embeddings. For each such neighbor $j$ , we create a directed edge, forming the edge set $E$ . As a result, each node has $k$ outgoing edges pointing to its $(i, j)$ -most semantically similar peers. The directionality of the graph reflects the asymmetry of similarity relationships—i.e., $j$ being among $i$ ’s neighbors does not imply the reverse.

This graph structure models a semantic neighborhood topology, where local connectivity reflects fine-grained behavioral alignment. The parameter $k$ controls the granularity of local interactions: a small $k$ captures tightly coupled students, while a larger $k$ introduces more global influence. In our experiments, we set $k = 15$ , balancing local sensitivity and global connectivity.

4.3.2. Graph training using GAT

Once the graph is constructed, we apply a Graph Attention Network to perform neural message passing and learn refined representations that incorporate peer influence. GAT is well-suited to this setting because it adaptively learns the importance of different neighbors, which is particularly useful in behavior modeling where not all similar students contribute equally.

We choose to use the fused vectors rather than just BERT or numerical features for the following reasons. Semantic embeddings provide rich contextual understanding but may lack precision in capturing statistical extremes or variations, which are vital for identifying outliers or habitual deviations. Conversely, statistical features alone lack the expressive capacity to model inter-feature dependencies and behavioral semantics. The fused input allows GAT to learn attention weights that consider both what students do and how their behavior is described, resulting in more informed and nuanced aggregation from neighboring nodes. The GAT then performs message passing over the semantically-consistent graph to fuse these heterogeneous features.

Moreover, because the graph structure is built purely from BERT embeddings, injecting statistical features into the training stage does not affect the structural alignment but complements it by enriching the learning signal. This separation, semantic-based graph construction and hybrid-based message passing, ensures that the learned embeddings benefit from meaningful neighborhood contexts while remaining sensitive to diverse aspects of student behavior.

Formally, given node features $X = {x_{1}, \dots, x_{N}}$ and edge index $E$ , a GAT layer computes the updated representation $z_{i}$ by aggregating transformed features from its neighbors $j \in N (i)$ , weighted by learned attention coefficients:

z_{i} = σ (\sum_{j \in N (i)} α_{i j} W x_{j})

(3)

where

W

is a learnable linear transformation, and

σ

is a non-linear activation (ReLU). The attention weight

α_{i j}

is computed by comparing the transformed features of node

i

and its neighbor

j

α_{i j} = \frac{\exp (LeakyReLU (a^{T} [W x_{i} ∥ W x_{j}]))}{\sum_{k \in N (i)} \exp (LeakyReLU (a^{T} [W x_{i} ∥ W x_{k}]))}

(4)

where

a

are learnable vectors and

∥

denotes concatenation. This mechanism allows the network to weigh neighbors differently, adapting to heterogeneity in student behavior patterns. Compared with fixed-weight aggregators (e.g., GCN), GAT provides data-driven flexibility in determining which peers are most informative.

To improve the model’s expressive power, we stack two GAT layers, each with an intermediate hidden dimension of 32. The second layer allows information from second-order neighbors (i.e., neighbors of neighbors) to influence node representations, capturing broader structural patterns. We apply ReLU activation between layers to introduce non-linearity and stabilize training.

4.3.3. Self-supervised training with reconstruction loss

As labels for student types or anomalies are not available during embedding learning, we adopt a self-supervised learning strategy. The core idea is to make the final embedding $z_{i}$ from GAT as informative and close to the original input $x_{i}$ as possible, while also benefiting from the structure of the graph. We use a simple reconstruction loss to enforce this:

L = \frac{1}{N} \sum_{i = 1}^{N} ‖ z_{i} - x_{i} ‖_{2}^{2}

(5)

This objective encourages the GAT to act as a self-supervised structure-aware representation learning module.

Ultimately, the output embeddings ${z_{i}}_{i = 1}^{N}$ serve as the final context-enriched representations for downstream clustering and anomaly detection, benefiting from both local graph structure and feature-level semantic fusion.

5. Experiments

This section presents a comprehensive evaluation of the proposed BehavGLM framework on real-world student campus behavior data. The experiments are designed to assess the model’s ability to identify meaningful behavior clusters and support downstream analyses such as anomaly detection and subgroup interpretation.

To guide the evaluation, we focus on the following research questions:

Q1.
Does the integration of semantic embeddings and graph-based relational modeling enhance the clustering of student behavior profiles?
Q2.
Are the resulting clusters behaviorally meaningful, and do they support downstream tasks such as anomaly detection?

5.1. Clustering using BERT and GAT

In this section, we provide a detailed description of the experimental setup, including the selection of clustering algorithms, internal evaluation metrics, and comparative results among different models.

5.1.1. Clustering algorithms

To explore the latent structures in student behavior data, we apply five clustering algorithms to the final embeddings produced by the BERT-GAT framework. These methods span multiple paradigms, offering diverse perspectives on cluster formation in high-dimensional semantic space.

K-Means, a classic partitioning algorithm, minimizes within-cluster variance by iteratively updating centroids based on Euclidean distance.³² Its simplicity and efficiency make it suitable for well-separated spherical clusters, but it is sensitive to initialization and often struggles with non-convex structures. To address the limitations of hard assignment and geometric rigidity, we evaluate Gaussian Mixture Models (GMM), which adopt a probabilistic framework to model data as a mixture of Gaussians.³³ GMMs provide soft assignments and are better suited for elliptical or overlapping clusters, although they are more sensitive to outliers and parameter estimation. We include BIRCH, which incrementally builds a clustering feature tree to summarize data and supports fast, hierarchical aggregation.³⁴ While BIRCH is robust to noise, its performance can degrade when clusters are poorly separated or global optimization is needed. Agglomerative hierarchical clustering is also considered for its interpretability and ability to capture nested structures via bottom-up merges based on Ward linkage.³⁵ Although deterministic, it is less scalable and sensitive to local merge decisions. Lastly, spectral clustering leverages the eigenstructure of a similarity matrix to perform dimensionality reduction before clustering in the spectral space.³⁶ It effectively detects non-convex and global patterns but requires careful graph construction and is computationally expensive for large datasets. Furthermore, we include Deep Embedded Clustering (DEC) as a deep learning baseline.³⁷ DEC simultaneously learns feature representations and cluster assignments by optimizing a Kullback–Leibler (KL) divergence-based objective, allowing for iterative refinement of cluster centers.

To ensure fair comparison, we avoid enforcing a unified number of clusters across all methods. For algorithms like K-Means and GMM, we examine multiple values of $k$ to evaluate robustness. For hierarchical methods such as BIRCH and Agglomerative Clustering, flat cluster partitions are extracted from the hierarchy to support standardized evaluation.

5.1.2. Evaluation metrics

To quantitatively assess the clustering quality, we employ four widely used internal evaluation metrics: Silhouette Score, Calinski-Harabasz Index (CH Index), Davies-Bouldin Index (DB Index) and Scattering and Density-Based Clustering Validity (S_Dbw).

The Silhouette Score measures how similar a sample is to its own cluster compared to other clusters. For a data point $i$ , let $a (i)$ be the average distance to all other points in the same cluster, and $b (i)$ the lowest average distance to points in any other cluster. The silhouette coefficient is defined as shown in equation (6). The overall silhouette score is the mean of $s (i)$ across all samples. It ranges from $- 1$ to $1$ , where higher values indicate better clustering.

s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}

(6)

The CH Index evaluates cluster validity by the ratio of between-cluster dispersion to within-cluster dispersion. Given a dataset with $n$ samples, $k$ clusters, and total within-cluster dispersion $W$ , and between-cluster dispersion $B$ , the CH Index is defined as shown in equation (7). A higher CH Index suggests better defined and well-separated clusters.

CH = \frac{Tr (B) / (k - 1)}{Tr (W) / (n - k)}

(7)

The DB Index measures the average similarity between each cluster and its most similar counterpart. It’s defined as shown in equation (8). And $s_{i}$ is the average intra-cluster distance for cluster $i$ , and $d_{i j}$ is the distance between cluster centers $i$ and $j$ . Lower DB Index values indicate better clustering with high separation and compactness.

DB = \frac{1}{k} \sum_{i = 1}^{k} max_{j \neq i} \frac{s_{i} + s_{j}}{d_{i j}}

(8)

The S_Dbw Index is a density-aware internal metric free of convex cluster assumptions. It consists of two core components: intra-cluster scattering $Scat (C)$ that measures the dispersion degree within each cluster, and inter-cluster density separation $Dens (C)$ that quantifies the separation degree between different clusters, with the overall index defined as shown in equation (9). S_Dbw ranges from 0 to positive infinity, and lower values represent better clustering performance, indicating tighter intra-cluster cohesion and clearer inter-cluster separation.

S\_Dbw = Scat (C) + Dens (C)

(9)

5.1.3. Clustering results analysis

Table 4 presents the clustering outcomes under three different configurations: the baseline model without BERT or GAT, a semantically enhanced variant using BERT, and the full framework enhanced by both BERT and GAT. Across all six clustering algorithms, we observe consistent and substantial performance improvements as semantic and relational components are successively integrated, demonstrating the robustness and generalizability of the proposed representation pipeline. To examine the contribution of each module in greater depth, we take Agglomerative Clustering as a representative example and compare its performance across the three configurations. With the introduction of BERT embeddings, the Silhouette Coefficient improves marginally from 0.3474 to 0.3505, indicating enhanced intra-cluster cohesion. More notably, the CH Index increases from 6074.78 to 6642.27, reflecting a more distinct separation between cluster centroids. Meanwhile, the DB Index drops from 1.2700 to 1.2532, and the S_Dbw Index decreases from 0.6024 to 0.5919, both suggesting reduced dispersion within clusters and improved inter-cluster density separation. These trends confirm that even semantic transformation alone by enriching feature context can sharpen cluster boundaries and improve structural coherence.

Table 4.
Comparison of clustering methods with different model architectures.

BERT GAT Model Silhouette score CH index DB index S_Dbw

$\times$ $\times$ K-means 0.314 5916.9649 1.2711 0.5982

GMM 0.2661 5214.3570 1.4970 0.6101

Birch 0.2314 5339.3383 1.4099 0.6009

Agglomerative 0.3474 6074.7759 1.2700 0.6024

Spectral 0.3437 4129.7090 2.1467 0.6584

DEC 0.3073 5809.1801 1.2228 0.6004

$✓$ $\times$ K-means 0.3457 6684.3928 1.2597 0.5896

GMM 0.2664 5980.2806 1.3975 0.6043

Birch 0.2335 5704.7742 1.3811 0.5919

Agglomerative 0.3505 6642.2668 1.2532 0.5919

Spectral 0.3703 5018.0196 1.0546 0.6313

DEC 0.3082 6255.4021 1.2087 0.5930

$✓$ $✓$ K-means 0.4692 12951.0705 0.8170 0.4027

GMM 0.4542 11595.3779 0.9659 0.4130

Birch 0.3157 9113.4477 0.9068 0.4093

Agglomerative 0.5546 20615.7716 0.7207 0.4018

Spectral 0.4825 11643.7867 0.9258 0.4919

DEC 0.4347 12785.2609 0.9835 0.5005

BERT	GAT	Model	Silhouette score	CH index	DB index	S_Dbw
$\times$	$\times$	K-means	0.314	5916.9649	1.2711	0.5982
		GMM	0.2661	5214.3570	1.4970	0.6101
		Birch	0.2314	5339.3383	1.4099	0.6009
		Agglomerative	0.3474	6074.7759	1.2700	0.6024
		Spectral	0.3437	4129.7090	2.1467	0.6584
		DEC	0.3073	5809.1801	1.2228	0.6004
$✓$	$\times$	K-means	0.3457	6684.3928	1.2597	0.5896
		GMM	0.2664	5980.2806	1.3975	0.6043
		Birch	0.2335	5704.7742	1.3811	0.5919
		Agglomerative	0.3505	6642.2668	1.2532	0.5919
		Spectral	0.3703	5018.0196	1.0546	0.6313
		DEC	0.3082	6255.4021	1.2087	0.5930
$✓$	$✓$	K-means	0.4692	12951.0705	0.8170	0.4027
		GMM	0.4542	11595.3779	0.9659	0.4130
		Birch	0.3157	9113.4477	0.9068	0.4093
		Agglomerative	0.5546	20615.7716	0.7207	0.4018
		Spectral	0.4825	11643.7867	0.9258	0.4919
		DEC	0.4347	12785.2609	0.9835	0.5005

Note: The best performing values for each metric are highlighted in bold, with the second-best underlined.

Upon further incorporating GAT-based relational modeling, the performance gains become significantly more pronounced. In the Agglomerative setting, the Silhouette Coefficient rises dramatically to 0.5546, a relative gain of 18.43% over the BERT-only variant. The CH Index surges to 16343.98, the DB Index drops to 0.7207, marking a 53.25% reduction, and the S_Dbw Index further decreases to 0.4018, a relative reduction of 19.01% compared to the BERT-only variant. These improvements reflect not only greater intra-cluster compactness, but also enhanced inter-cluster separation. Similar trends are consistently observed across other clustering algorithms, reinforcing the broad effectiveness of the combined approach.

Among the three evaluation metrics, the CH Index shows the most substantial absolute increase a pattern that deserves specific attention. This index, which simultaneously rewards between-cluster separation and penalizes within-cluster dispersion, is particularly sensitive to structural refinement in high-dimensional spaces. The sharp rise observed after integrating both BERT and GAT suggests that our framework not only reduces redundancy and noise within clusters but also structurally magnifies latent differences across behavior groups. Importantly, this improvement is not isolated to one method or setting, but rather generalizes across diverse clustering paradigms, highlighting the versatility of the proposed architecture.

Taken together, these results demonstrate a clear two-stage enhancement mechanism: the BERT module captures contextual semantics from structured behavioral features, translating them into more discriminative embeddings; the GAT module then injects structure-aware refinement by modeling inter-student similarity patterns within a graph context. This combination enables the discovery of more coherent, well-separated, and semantically meaningful clusters, as quantitatively supported by all three evaluation metrics.

To visually evaluate the quality of the learned latent space representation, we employed t-SNE to project the results of hierarchical clustering onto a two-dimensional plane. As shown in Figure 3, the model identifies clear cluster cores with high intra-cluster cohesion. While some visual overlap and boundary transitions are observed among the clusters, we attribute this phenomenon to two interrelated factors: First, the extreme dimensionality reduction from an high dimensional feature space to two dimensions inevitably introduces information loss and visual distortion, which is an inherent limitation of t-SNE projection. Second, certain boundary samples genuinely represent students with transitional behavioral patterns, exhibiting mixed characteristics that span multiple groups, which reflects the inherent continuity and complexity of human behavior in a real campus environment. These two factors jointly contribute to the observed boundary phenomena in the visualization.

Figure 3.

t-SNE visualization of BehavGLM embeddings with hierarchical clustering.

5.1.4. Parameter sensitivity analysis

To evaluate the impact of the graph construction parameter $k$ on representation learning and clustering quality, we conduct a sensitivity analysis using hierarchical clustering as an example, with $k$ varied from 5 to 25 in Table 5.

Table 5.
Clustering performance with different $k$ values in graph construction.

$k$ Silhouette score CH index DB index S_Dbw

5 0.4889 15648.84 0.8334 0.4513

10 0.4903 16343.98 0.7808 0.4150

15 0.5546 20615.77 0.7207 0.4018

20 0.5492 19108.14 0.7299 0.4019

25 0.5393 17684.04 0.7324 0.4106

$k$	Silhouette score	CH index	DB index	S_Dbw
5	0.4889	15648.84	0.8334	0.4513
10	0.4903	16343.98	0.7808	0.4150
15	0.5546	20615.77	0.7207	0.4018
20	0.5492	19108.14	0.7299	0.4019
25	0.5393	17684.04	0.7324	0.4106

As illustrated in Table 5, the model achieves its peak performance at $k = 15$ , where all three evaluation metrics reach their optimal values. Notably, across the entire tested range of $k$ (from 5 to 25), the framework integrated with GAT consistently outperforms both the raw numerical clustering and the BERT-only clustering scenarios. This sustained superiority demonstrates that the graph-based relational modeling effectively captures inter-student behavioral dependencies more robustly than traditional methods. The results indicate that the inclusion of topological constraints through GAT provides a more stable and discriminative latent space, making the framework highly effective regardless of the specific neighborhood size.

5.1.5. Computational efficiency

Experiments run on workstations equipped with NVIDIA A6000 GPUs. In tests, the model completes within a reasonable engineering timeframe for whole-school datasets. In practice, the total runtime for the complete analysis remains under 20 min per execution, confirming that the integration of deep learning modules does not impede the model’s scalability for routine educational administrative tasks.

5.2. Comparison of semantic encoding baselines

To validate the effectiveness of BERT for encoding campus behavior data, we compare it with two classical text encoding methods: TF-IDF³⁸ and Doc2Vec.³⁹ For fair comparison, we replace the BERT encoder in our framework with these methods while keeping all other components unchanged.

Table 6 presents the clustering results using Agglomerative Clustering. BERT consistently outperforms both TF-IDF and Doc2Vec across all four metrics. These results demonstrate that BERT’s contextualized embeddings can better capture the semantic relationships in the textual behavior profiles, validating our choice of BERT as the encoding component in our model.

Table 6.
Clustering performance of different embedding methods.

Method Silhouette score CH index DB index S_Dbw

TF-IDF 0.2997 11664.56 1.1858 0.9808

Doc2Vec 0.4222 11222.53 0.9012 0.4946

BERT 0.5546 20615.7716 0.7207 0.4018

Method	Silhouette score	CH index	DB index	S_Dbw
TF-IDF	0.2997	11664.56	1.1858	0.9808
Doc2Vec	0.4222	11222.53	0.9012	0.4946
BERT	0.5546	20615.7716	0.7207	0.4018

5.3. Analysis of clustered behavior patterns

To further interpret the clustering results, this section conducts an in-depth analysis of student behavioral patterns within the identified groups. Since the agglomerative clustering method achieved the best overall performance under the proposed BehavGLM, it is selected as the basis for subsequent interpretation.

Considering the high dimensionality of the original behavioral dataset, we first reduce the complexity of the data by selecting a set of representative features. This step helps avoid visual clutter and ensures the clarity of downstream pattern exploration. Based on the selected features, we then examine the behavioral differences among clusters through visualization, aiming to uncover distinct behavioral profiles and provide intuitive evidence for the model’s discriminative capacity.

5.3.1. Feature selection

To reduce dimensional complexity while preserving the essential behavioral variance in the data, we apply Principal Component Analysis (PCA) to the original numerical features.⁴⁰ As shown in Figure 4, the cumulative explained variance ratio demonstrates that the first nine principal components capture over 90% of the total variance, thus offering a compact yet informative representation of the data.

Figure 4.

PCA cumulative variance ratio curve.

To identify the most representative original variables contributing to these components, we adopt a weighted loading analysis. Specifically, for each feature $x_{j}$ , we calculate its absolute loading $| l_{i j} |$ on the $i$ th principal component and weight it by the variance explained by that component $v_{i}$ . The final weighted contribution score $S_{j}$ for each feature is computed as shown in equation (9), where $k = 9$ is the number of selected principal components. This method accounts for both the strength of each feature’s contribution to each component and the relative importance of the components themselves.

S_{j} = \sum_{i = 1}^{k} v_{i} \cdot | l_{i j} |

(10)

Using a threshold of 15.00% for the normalized scores, eight high-impact features are selected: dn_perc, dn_med, ln_perc, TSG_JC, dn_max, bf_perc, ln_mode, TSG_CRmtr. These variables will serve as the foundation for the subsequent visual analysis of behavioral clustering.

5.3.2. Behavioral pattern analysis

To further interpret the behavioral characteristics of different groups of students, we used a multidimensional visualization approach based on the eight key behavioral characteristics. As shown in Figure 5, we utilized an improved parallel coordinates plot, where multiple standardized behavioral dimensions are represented as parallel vertical axes. The average values of each group across these dimensions are connected via polylines, forming a visual trajectory that reflects the overall behavioral tendencies of each group.

Figure 5.

Visualization of clusters based on core features.

To capture variability within the cluster, the visualization incorporates semitransparent bands around each polyline to represent one standard deviation above and below the mean. This augmentation enables not only the identification of intercluster behavioral differences but also provides insight into intracluster consistency or heterogeneity across dimensions.

The figure illustrates the distributional patterns of five clusters (from cluster one to cluster five) across key features:

Cluster one: shows consistently low values across most dimensions, particularly for lunch-related features (ln_perc, ln_mode), suggesting an extremely low frequency or absence of lunch activities. Similarly, values for library usage (TSG_JC, TSG_CRmtr) are minimal, indicating infrequent and narrowly distributed access. The wide standard deviation bands across features imply high within-group variability and behavioral inconsistency.

Cluster two and Cluster four: exhibit similar profiles in dining behavior, with higher average values across all meal-related dimensions, reflecting consistent and frequent on-campus dining activities. However, their distinction lies in library behavior: Cluster four demonstrates much higher values in both TSG_JC and TSG_CRmtr, implying frequent and diverse library usage, while Cluster two shows near-zero values, indicating a near-complete absence of library engagement. Moreover, Cluster four displays broader deviation bands in most dimensions, suggesting greater behavioral diversity within the group compared to Cluster two.

Cluster three: is characterized by the lowest breakfast participation (bf_perc) and relatively high values for lunch and dinner features. It also demonstrates frequent and temporally diverse library usage, suggesting an active and flexible academic routine. However, in terms of the standard deviation range, Category 3 exhibits a larger fluctuation interval in some feature dimensions.

Cluster five: presents extreme values near zero in all dinner and library related characteristics, suggesting almost no involvement in on-campus evening meals or academic facility use. This group reflects a group with minimal participation in structured campus activities and notably low variability between characteristics, indicating high behavioral consistency, although in an anomalous or disconnected pattern.

In general, this visualization offers an interpretable, information-rich representation of both the central tendencies and the variabilities within each cluster, facilitating a comprehensive understanding of the student’s behavior patterns across the selected key dimensions.

5.4. Detection of behavioral outliers

To further uncover students whose behavioral patterns diverge from peer norms, we perform unsupervised anomaly detection based on the cluster structure. Specifically, we first detect clusters that exhibit abnormal internal variability using a robust cluster-level detection method, and subsequently identify individual-level anomalies within those clusters through a local density-based approach. This two-stage process enables the discovery of both structurally unstable behavioral groups and behaviorally disconnected individuals.

5.4.1. ROCF-based detection of anomalous clusters

To identify structurally unstable groups, we adopt a modified ROCF approach that emphasizes intra-cluster behavioral variance rather than cluster size. High internal variance suggests behavioral heterogeneity and potential anomalies. The ROCF value combines the variance of each cluster ${var}_{i}$ and its relative increase, the Temporal Leap (TL):

{ROCF}_{i} = \frac{{var}_{i}}{{var}_{i - 1}} \times {TL}_{i}, {TL}_{i} = \frac{{var}_{i} - {var}_{i - 1}}{{var}_{i - 1}}

(11)

A higher ROCF indicates a sharp increase in behavioral dispersion, flagging the corresponding cluster as behaviorally anomalous.

The ROCF-based detection results are shown in Table 7. Cluster 5 emerges as a statistically anomalous group, exhibiting the highest total intra-cluster variance and the largest ROCF score, which suggests internal behavioral volatility and potential structural deviation. Cluster 3 is not assigned a TL or ROCF score due to the lack of a subsequent cluster in the sorted sequence. Importantly, prior analysis of Cluster 3 in Section 5.3 has shown relatively regular and positive behavioral tendencies, indicating that this group is behaviorally distinct but not structurally anomalous.

Table 7.

ROCF-based anomaly indicators for behavioral clusters.

Cluster	Volume (Var Sum)	TL	ROCF
1	11.374415	0.052624	1.054033
4	11.972984	0.023460	1.023738
2	12.253874	0.116257	1.123285
5	13.678476	0.117710	1.124918
3	15.288575	—	—

5.4.2. LOF-based detection of anomalous individuals

To further identify local anomalies within Cluster 5, we employ the Local Outlier Factor (LOF) algorithm,⁴¹ which detects deviations based on local density. For each student $x$ , the LOF score quantifies how its local reachability density compares with that of its $k$ -nearest neighbors:

{LOF}_{k} (x) = \frac{1}{| N_{k} (x) |} \sum_{y \in N_{k} (x)} \frac{{lrd}_{k} (y)}{{lrd}_{k} (x)}

(12)

We apply LOF to Cluster 5 with $k = 20$ and a contamination rate of 0.005. This yields 5 students (0.5%) flagged as local behavioral outliers.

5.4.3. Outlier behavioral profiles

To visually interpret the detected outliers’ behavioral deviations, original features are grouped into four categories: (a) Turnstile and Library Access, (b) Breakfast, (c) Lunch, and (d) Dinner. Standardized values of each feature are plotted in the form of heatmaps, as shown in Figure 6. Red tones indicate significantly above-average behavior, while blue tones indicate suppression or absence of activity.

Figure 6.

Differential heatmaps of student behaviors across (a) campus gate and library, (b) breakfast, (c) lunch, and (d) dinner.

The heatmaps reveal distinct patterns of deviation among the five detected students, indicating that the identified outliers do not follow a single anomalous trajectory but instead fall into distinct subtypes of behavioral disconnection. Based on these differentiated patterns, the behavioral profiles of the outliers are further summarized as follows:

Outlier 1: displays consistently high values across nearly all dimensions. This student exhibits exceptionally frequent gate crossings, long campus stay durations, and intensive library use. Their meal-related behaviors are also highly active, suggesting a highly regular yet over-engaged behavioral pattern.

Outlier 2: also demonstrates above-average library-related activity, but exhibits almost no lunch or dinner records, indicating fragmented engagement or an unusual schedule.

Outliers 3–5: display a highly consistent pattern of abnormal behavior. Across all dimensions except dinner-related features, these students exhibit either zero values or complete data absence. In contrast, they show abnormally high values in specific dinner indicators. This consistent trend suggests that these individuals may follow a hidden behavioral mode, engaging in observable campus activities exclusively during evening hours. Their lack of presence across daytime dimensions points to either a disconnection from typical daily routines or potential gaps in behavioral data collection.

These outliers can be broadly categorized into three behavioral types, as summarized in Table 8.

Table 8.

Summary of outlier behavioral types.

Type ID	Type name	Description
A	Globally Active	Characterized by high engagement across all behavior dimensions. May indicate overcommitment; academic monitoring is advised.
B	Fragmented Routine	Marked by missing behavioral records, especially in dining or study activities. This may reflect irregular routines; investigation into time management or dietary patterns is recommended.
C	Evening-Isolated	Shows behavioral traces only during dinner hours, with little or no activity during the day. Such asynchronous patterns warrant interviews to assess psychological and lifestyle factors.

5.5. Association between anomalies and external attributes

To investigate the potential connection between behavioral anomalies and academic performance, we assessed the similarity between the behavioral profiles of students on academic probation and the three previously identified anomalous behavior types. While behavioral abnormality and academic risk are conceptually distinct, overlaps may exist. This analysis aims to explore whether behavioral anomaly profiling could offer supplementary insight for identifying at-risk students.

We computed the Euclidean distance between each of the 12 academically flagged students and the centroid profiles of the three anomaly types, resulting in a similarity matrix shown in Figure 7. For interpretation, higher similarity values indicate closer alignment in behavioral patterns. Rather than applying rigid thresholds, we focused on relative differences in similarity to identify the dominant pattern for each student.

Figure 7.

Similarity matrix between academic warning students and abnormal types.

The analysis revealed that 9 out of the 12 students demonstrated a clear proximity to one behavioral anomaly type: six students (ID 1, 2, 4, 7, 11, 12) were most closely aligned with Type B, one student (ID 3) showed highest similarity to Type A, two students (ID 8, 9) closely matched Type C.The remaining three students (ID 5, 6, 10) did not exhibit strong alignment with any of the defined types.

These results offer a preliminary validation for the behavioral anomaly typology, indicating that it captures meaningful patterns present in a portion of the academic risk population. While not all students fit within the current classification, the alignment observed in the majority suggests that behavioral data can provide valuable complementary signals for academic risk identification. Future research could further enrich the anomaly taxonomy by incorporating a broader set of at-risk cases and refining behavioral descriptors.

6. Conclusion

In conclusion, BehavGLM demonstrates a powerful and interpretable framework for modeling complex student behaviors from structured campus data. By effectively capturing both latent semantic patterns and inter-student relationships, it achieves notable improvements in clustering performance and enables the detection of meaningful behavioral anomalies. And the framework facilitates deeper insights into student routines, supports early identification of outliers, and provides a foundation for behavior-aware educational decision-making. These results highlight BehavGLM’s potential in large-scale learning analytics systems, with promising applications in dynamic monitoring and personalized student support.

Future work will focus on extending the framework to incorporate real-time behavioral data streams and longitudinal student records. Furthermore, we also plan to explore more diversified graph building strategies in future research, to address the limitations of relying solely on behavioral similarity assumptions for student graph building.

Ethical approval and informed consent statements

Not applicable

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science and Technology Major Project [grant number 2022ZD0117102]; the National Natural Science Foundation of China [grant numbers 62472014, U21B2038]; the Beijing Natural Science Foundation [grant number 4222021]; and the R&D Program of Beijing Municipal Education Commission [grant number KZ202210005008].

Footnotes

Declaration of conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.

ORCID iDs

Yujia Guo

Xiaoyong Li

Yisheng Yang

Sufang An

Yong Zhang

References

Ahmad

Iqbal

El-Hassan

, et al. Data-driven artificial intelligence in education: a comprehensive review. IEEE Trans Learn Technol 2024; 17: 12–31.

Chen

Yang

Yuan

, et al. Animating the crowd mirage: a WiFi-positioning-based crowd mobility digital twin for smart campuses. Proc ACM Interact Mob Wearable Ubiquitous Technol 2024; 8: 1–32.

Chen

Xie

Hwang

. A multi-perspective study on artificial intelligence in education: grants, conferences, journals, software tools, institutions, and researchers. Comput Educ Artif Intell 2020; 1: 100005.

Cao

Gao

Lian

, et al. Orderliness predicts academic performance: behavioral analysis on campus lifestyle. J R Soc Interface 2018; 15: 20180224.

Mandalapu

Chen

Shetty

, et al. Student-centric model of learning management system activity and academic performance: from correlation to causation. arXiv preprint arXiv:2210.15430 (2022).

Zhang

, et al. Multi-view hypergraph neural networks for student academic performance prediction. Eng Appl Artif Intell 2022; 114: 105174.

Sridharan

Akilashri

PSS

. Multimodal learning analytics for students behavior prediction using multi-scale dilated deep temporal convolution network with improved chameleon swarm algorithm. Expert Syst Appl 2025; 286: 128113.

Wang

Zeng

, et al. SLBDetection-Net: towards closed-set and open-set student learning behavior detection in smart classroom of k-12 education. Expert Syst Appl 2025; 260: 125392.

Pennisi

. Using trace data to enhance students’ self-regulation: a learning analytics perspective. Internet High Educ 2022; 54: 100855.

10.

Zhao

Song

. Research on the group behavioral characteristics of learners in blended courses based on cluster analysis. In: Proceedings of the 2025 International Conference on Big Data and Informatization Education, 2025, pp.442–446.

11.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, 2019, pp.4171–4186.

12.

Veličković

Cucurull

Casanova

, et al. Graph attention networks. In: Proceedings of the international conference on learning representations (ICLR), 2018. DOI: 10.48550/arXiv.1710.10903.

13.

Yao

Lian

Cao

, et al. Predicting academic performance for college students: a campus behavior perspective. ACM Trans Intell Syst Technol 2019b; 10: 24:1–24:21.

14.

Wang

Guo

, et al. Student performance prediction with short-term sequential campus behaviors. Information 2020; 11: 201.

15.

Whelan

Islam

Brooks

. Applying the SOBC paradigm to explain how social media overload affects academic performance. Comput Educ 2020; 143: 103692.

16.

Martinez

ALJ

Sood

Mahto

. Early detection of at-risk students using machine learning. arXiv abs/2412.09483 (2024).

17.

Boufaida

Benmachiche

Derdour

, et al. TSA-GRU: a novel hybrid deep learning module for learner behavior analytics in MOOCs. Future Internet 2025; 17: 355.

18.

Angeioplastis

Aliprantis

Konstantakis

, et al. The learning style decoder: FSLSM-guided behavior mapping meets deep neural prediction in LMS settings. Computers 2025; 14: 377.

19.

Huang

Khetan

Cvitkovic

, et al. Tabtransformer: tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678arXiv preprint arXiv:2012.06678 (2020).

20.

SÖ

Arik

Pfister

. TabNet: attentive interpretable tabular learning. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 35. AAAI Press, 2021, pp.6679–6687.

21.

Shukla

Veerasamy

Alduaiji

, et al. Encoder only attention-guided transformer framework for accurate and explainable social media fake profile detection. Peer-to-Peer Netw Appl 2025; 18: 232.

22.

Shi

Hua

, et al. Tabdiff: a mixed-type diffusion model for tabular data generation. In: Proceedings of the international conference on learning representations (ICLR), 2025. DOI: 10.48550/arXiv.2410.20626.

23.

Wang

, et al. TableGPT2: a large multimodal model with tabular data integration. arXiv preprint arXiv:2411.02059 (2024).

24.

Zhang

Liang

, et al. TableLLM: enabling tabular data manipulation by LLMs in real office usage scenarios. arXiv preprint abs/2403.19318 (2024).

25.

Ioannidis

Marques

Giannakis

. A recurrent graph neural network for multi-relational data. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP). Brighton, UK, 2019, pp.8157–8161. DOI: 10.1109/ICASSP.2019.8682836.

26.

Xia

Huang

, et al. Multi-behavior graph neural networks for recommender system. IEEE Trans Neural Netw Learn Syst 2024; 35: 5473–5487.

27.

Zhou

Liu

Chen

, et al. Table2Graph: transforming tabular data to unified weighted graph. In: Proceedings of the thirty-first international joint conference on artificial intelligence (IJCAI), 2022, pp.2420–2426.

28.

Zheng

Koh

Jin

, et al. Correlation-aware spatial–temporal graph learning for multivariate time-series anomaly detection. IEEE Trans Neural Netw Learn Syst 2023; 35: 11802–11816.

29.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, 2016, pp.785–794. DOI: 10.1145/2939672.2939785.

30.

Wang

, et al. Deep & cross network for ad click predictions. In: Proceedings of the KDD workshop on adversarial learning for knowledge discovery (ADKDD’17). ACM, 2017, p. n. pag. DOI: 10.1145/3124749.3124754.

31.

Cover

Hart

. Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13: 21–27.

32.

Lloyd

. Least squares quantization in PCM. IEEE Trans Inf Theory 1982; 28: 129–137.

33.

Reynolds

. Gaussian mixture models.Encyclopedia of biometrics. New York, NY: Springer, 2009, pp.659–663.

34.

Zhang

Ramakrishnan

Livny

. Birch: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, 1996, pp.103–114. DOI: 10.1145/233269.233324.

35.

Müllner

. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011). DOI: 10.48550/arXiv.1109.2378.

36.

Jordan

Weiss

. On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems (NeurIPS), Vol. 14. 2002, pp. 849–856.

37.

Xie

Girshick

Farhadi

. Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. PMLR, 2016, pp.478–487.

38.

Salton

Yang

. On the specification of term values in automatic indexing. J Doc 1973; 29: 351–372.

39.

Mikolov

. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, 2014, pp.1188–1196.

40.

Pearson

. On lines and planes of closest fit to systems of points in space. Philos Mag 1901; 2: 559–572.

41.

Breunig

Kriegel

, et al. LOF: identifying density-based local outliers. ACM SIGMOD Record 2000; 29: 93–104.

BehavGLM: A graph-enhanced language model for unsupervised student campus behavior clustering

Abstract

Keywords

1. Introduction

2.1. Analysis of student behavior

2.2. Semantic representation learning for tabular data

2.3. Graph neural networks for relational behavior modeling

3. Data preprocessing

3.1. Dataset

3.2. Feature engineering

3.2.1. Feature extraction

Table 1. Campus gate access features. Feature name Feature description ZJ_JC Turnstile entry-exit count ZJ_maxt Maximum entry duration

4. Method

4.1. Model structure

4.3.1. Graph construction

4.3.2. Graph training using GAT

Q1. Does the integration of semantic embeddings and graph-based relational modeling enhance the clustering of student behavior profiles? Q2. Are the resulting clusters behaviorally meaningful, and do they support downstream tasks such as anomaly detection? 5.1. Clustering using BERT and GAT

5.1.1. Clustering algorithms

5.1.2. Evaluation metrics

Table 5. Clustering performance with different k values in graph construction. k Silhouette score CH index DB index S_Dbw 5 0.4889 15648.84 0.8334 0.4513 10 0.4903 16343.98 0.7808 0.4150 15 0.5546 20615.77 0.7207 0.4018 20 0.5492 19108.14 0.7299 0.4019 25 0.5393 17684.04 0.7324 0.4106

5.2. Comparison of semantic encoding baselines

Table 6. Clustering performance of different embedding methods. Method Silhouette score CH index DB index S_Dbw TF-IDF 0.2997 11664.56 1.1858 0.9808 Doc2Vec 0.4222 11222.53 0.9012 0.4946 BERT 0.5546 20615.7716 0.7207 0.4018

5.3.1. Feature selection

5.4.1. ROCF-based detection of anomalous clusters

Ethical approval and informed consent statements

Funding

Footnotes

Declaration of conflicting interest

Data availability

ORCID iDs

References

Table 1.
Campus gate access features.

Feature name Feature description

ZJ_JC Turnstile entry-exit count

ZJ_maxt Maximum entry duration

Q1.
Does the integration of semantic embeddings and graph-based relational modeling enhance the clustering of student behavior profiles?
Q2.
Are the resulting clusters behaviorally meaningful, and do they support downstream tasks such as anomaly detection?

5.1. Clustering using BERT and GAT

Table 5.
Clustering performance with different $k$ values in graph construction.

$k$ Silhouette score CH index DB index S_Dbw

5 0.4889 15648.84 0.8334 0.4513

10 0.4903 16343.98 0.7808 0.4150

15 0.5546 20615.77 0.7207 0.4018

20 0.5492 19108.14 0.7299 0.4019

25 0.5393 17684.04 0.7324 0.4106

Table 6.
Clustering performance of different embedding methods.

Method Silhouette score CH index DB index S_Dbw

TF-IDF 0.2997 11664.56 1.1858 0.9808

Doc2Vec 0.4222 11222.53 0.9012 0.4946

BERT 0.5546 20615.7716 0.7207 0.4018