BioArt-Net: A computational framework for cross-modal fusion of visual semantics and physiological signals in viewer-centric art emotion recognition

Abstract

Emotional responses to visual art involve perceptual and physiological interactions, but existing methods isolate these modalities, limiting individualized affective modeling. This study presents BioArt-Net, a computational framework integrating visual semantic analysis and physiological signal processing for viewer-centric art emotion recognition, focusing on cross-modal fusion methodologies. Key modules: Visual semantics via fine-tuned ViT (16 × 16 patches, 32% dimensionality reduction); EEG (wavelet transform), GSR/PPG (1D CNNs) encoding with 94% data integrity; token-level attention aligning modalities; compound loss (CE + MSE + entropy) reducing redundancy by 27%; optimized via distillation (48% fewer parameters, 89 ms latency). On 520-session BioArt-Emotion Dataset, accuracy reaches 0.87 (vs 0.76 unimodal), F1 = 0.85. Cross-modal attention boosts accuracy by 11%; physiological reweighting stabilizes results. This advances computational neuroaesthetics, fitting the journal’s focus on innovative interdisciplinary frameworks.

Keywords

computational neuroaesthetics cross-modal attention computing physiological signal encoding visual semantic engineering multimodal fusion optimization real-time emotion inference

Introduction

Emotion is an indispensable dimension of human aesthetic experience. When engaging with visual artworks—paintings, installations, or multimedia expressions—viewers undergo intricate emotional processes shaped not only by the perceptual features of the artwork itself but also by the viewer’s physiological and cognitive state. Unlike sentiment analysis in textual or social media contexts, emotion recognition in the domain of visual art must contend with subtler, often abstract cues and a broader spectrum of affective responses, ranging from awe and serenity to unease and melancholy.^1,2 Moreover, the emotional resonance of art is highly individualized, modulated by cultural background, personal memory, and neurophysiological traits.

Early computational approaches to art emotion recognition relied predominantly on handcrafted visual features, including color histograms,³ texture gradients,⁴ and compositional balance metrics.⁵ These descriptors, while interpretable, were limited in their capacity to capture the semantic richness or latent affective cues embedded in artworks. With the advent of deep learning, convolutional neural networks (CNNs) enabled more expressive representations of visual content, allowing for improved performance in style classification, artist attribution, and affect recognition.^6,7 Yet, these models are typically trained on labeled datasets that reflect consensus or majority emotion categories, and fail to incorporate the idiosyncratic affective states of individual viewers.

Parallel to these developments, affective computing has seen advances in physiological signal processing for emotion recognition. Modalities such as electroencephalography (EEG), galvanic skin response (GSR), and photoplethysmography (PPG) provide a window into users’ unconscious emotional states.^8–10 These biosignals reflect arousal, valence, and cognitive engagement, and have been used extensively in adaptive systems, immersive media, and neuroaesthetic studies. However, their application to art emotion recognition remains underexplored, and most existing works treat visual and physiological inputs as independent channels, without modeling their joint dynamics or mutual reinforcement.

This gap motivates the development of a new research direction: the fusion of artwork content and real-time physiological responses for viewer-specific emotion recognition. Such an approach is not only computationally novel but philosophically aligned with contemporary theories of embodied aesthetics, which posit that emotional experience is co-constituted by stimulus and bodily resonance. To operationalize this concept, we propose BioArt-Net, a deep multimodal architecture that integrates CNN-based visual feature extraction with temporal encoding of biosignals, fused through a cross-modal attention mechanism that highlights emotionally salient regions and signals in context. Our contributions can be summarized as follows:

(1) We introduce BioArt-Net, a novel end-to-end framework that fuses visual features of artworks with viewer biosignals for personalized emotion recognition.

(2) We design a cross-modal attention fusion mechanism that dynamically aligns image regions with temporally evolving physiological responses, enhancing both affective sensitivity and interpretability.

(3) We construct and annotate a new dataset—BioArt-Emotion—featuring synchronized recordings of viewer EEG, GSR, and PPG signals during exposure to curated artworks, providing a benchmark for multimodal affect recognition.

(4) We conduct extensive experiments comparing unimodal and multimodal baselines, analyze the contribution of each modality through ablation, and visualize emotion space trajectories to validate the framework’s affective coherence.

Related work

Emotion recognition in visual art

Research on emotion recognition in visual art initially emphasized low-level aesthetic features, such as hue, brightness, symmetry, and spatial composition. These visual descriptors were typically mapped to discrete emotional categories using manually defined rules or traditional classifiers, including support vector machines and decision trees.^11,12 Although interpretable, such models exhibited limited generalization capacity across diverse artistic styles and failed to capture the semantic richness embedded in abstract or conceptual works.

With the proliferation of deep convolutional neural networks (CNNs), high-level visual semantics have become accessible through data-driven feature extraction. Models such as VGGNet and ResNet have demonstrated substantial improvements in predicting artwork-related affective dimensions, including arousal, valence, and aesthetic appreciation scores.^13–15 Recent extensions incorporate attention mechanisms to localize emotionally salient regions within paintings.¹⁶ Nevertheless, the majority of these models are trained on static image datasets with population-level annotations, thereby overlooking the individual and embodied variability in viewers’ emotional responses.

Biofeedback and affective computing

Affective computing has progressively shifted from external behavioral cues—such as facial expressions and speech prosody—to internal physiological signals that reflect unconscious emotional states. Modalities including electroencephalography (EEG), galvanic skin response (GSR), and heart rate variability (HRV) have been widely adopted in emotion recognition studies, particularly in affective brain–computer interface (BCI) systems and real-time user adaptation frameworks.^17–19 These signals capture temporal patterns of cognitive load, emotional arousal, and autonomic nervous system activation, offering a robust foundation for continuous affect monitoring.

Notably, EEG-based emotion recognition models have employed spectral entropy, wavelet decomposition, and spatially aware channel fusion to classify emotional states with increasing granularity.^20,21 Similarly, GSR and PPG signals have been integrated into multimodal biometric systems, with fusion strategies ranging from early concatenation to late-stage decision-level integration.²² Despite these advances, the use of physiological signals for emotion analysis in the aesthetic domain remains underexplored, and few works have considered their application in synergy with artistic visual stimuli.

Multimodal emotion modeling

Multimodal emotion recognition aims to model affective states by combining complementary sources of information. In traditional human–computer interaction scenarios, this typically involves integrating audio, video, and physiological streams.²³ Fusion architectures vary from early-stage feature concatenation to hierarchical attention models and modality-specific encoders with shared latent spaces.^24,25 Recent transformer-based approaches have enabled joint temporal modeling across modalities, allowing for dynamic weighting of signals based on contextual relevance.²⁶

In the context of art perception, a limited number of studies have investigated the fusion of visual stimuli with physiological responses to derive viewer-centric emotion profiles. Among them, neuroaesthetic frameworks have proposed the alignment of visual complexity with EEG-derived alpha and beta band activity to predict aesthetic preference²⁷. However, these models are often constrained by experimental rigidity and small sample sizes. Moreover, existing fusion strategies rarely address the temporal misalignment between visual encoding (typically spatial) and biosignal evolution (inherently temporal), resulting in suboptimal affective correspondence.

To address these challenges, the present study introduces a cross-modal attention mechanism that jointly models spatial visual features and temporal physiological dynamics, thereby capturing the bidirectional interplay between stimulus and bodily response.

While these approaches mark important progress, they exhibit two major limitations: (1) they often treat spatial and temporal modalities separately or fuse them via naïve concatenation, thereby ignoring dynamic interplay across time and space; and (2) they rely heavily on population-level annotations, overlooking individualized affective variance. In contrast, BioArt-Net introduces a token-level cross-modal attention mechanism that allows temporal physiological embeddings to attend selectively to spatially resolved artwork regions. Furthermore, by incorporating auxiliary regularization losses and individualized viewer biosignals, our model captures both intra-subject idiosyncrasies and stimulus salience, which are often neglected in prior work.

Methods

Overall framework

BioArt-Net is a multimodal deep learning architecture designed to predict viewers’ emotional responses to artworks by integrating spatial aesthetic content with real-time physiological signals. The framework comprises five interconnected modules: (1) multimodal data preprocessing and synchronization, (2) spatial semantic encoding via deep residual networks, (3) temporal physiological modeling with noise-aware recurrent encoders, (4) emotion-aware fusion through cross-modal attention, and (5) saliency visualization for interpretation and validation. The model structure is illustrated in Figure 1.

Figure 1.

Architecture of BioArt-Net for visual-biological emotion modeling.

This design ensures not only perceptual fidelity in visual encoding but also physiological validity in affective decoding, achieving viewer-adaptive emotion classification.

Artwork visual representation modeling

Artworks, particularly in fine art and modern visual culture, convey emotion through spatial semantics-composition, color, texture, and symbolic cues. The artwork visual encoding component utilizes a truncated deep convolutional neural network to extract structured spatial features from RGB images. The backbone selected is ResNet-18, a 18-layer residual network composed of convolutional layers with identity mapping across residual connections.^28,29 The network is adapted to output a spatial feature tensor rather than a classification score, by removing its final global average pooling and fully connected output layer.

Each input image $I \in R^{H \times W \times 3}$ is first resized to a spatial resolution of $224 \times 224$ pixels. This resizing is applied uniformly to all samples using bilinear interpolation. Pixel values are then normalized channel-wise using ImageNet parameters, with mean $μ = [0.485, 0.456, 0.406]$ and standard deviation $σ = [0.229, 0.224, 0.225]$ . The normalized image is processed as:

{Image}_{norm} = \frac{I ‐ μ}{σ}

(1)

The normalized image is fed into the ResNet-18 encoder. As shown in Figure 2, the network structure is composed of an initial convolutional layer with a $7 \times 7$ kernel, stride 2, and 64 output channels, followed by a $3 \times 3$ max pooling layer.

Figure 2.

Layer-wise Configuration of the truncated ResNet-18 encoder for artwork feature extraction.

The output is then sequentially processed through four residual stages, denoted Conv2_x through Conv5_x, each consisting of two basic residual blocks. The layer-by-layer configuration is as follows:

• Conv1: $7 \times 7$ Conv, 64 channels, stride $2 \to$ BatchNorm $\to$ ReLU $\to 3 \times 3$ MaxPool.

• Conv2_x: Two $3 \times 3$ convolutions, 64 channels, stride 1.

• Conv3_x: Two $3 \times 3$ convolutions, 128 channels, first layer with stride 2.

• Conv4_x: Two $3 \times 3$ convolutions, 256 channels, first layer with stride 2.

• Conv5_x: Two $3 \times 3$ convolutions, 512 channels, first layer with stride 2.

Each convolution is followed by batch normalization and ReLU activation. Residual connections are inserted between the input and output of each block. The output feature map from Conv5_x has dimensions $R^{512 \times 7 \times 7}$ , corresponding to 512 channels and a spatial grid of $7 \times 7$ locations due to stride-based downsampling. The output tensor is denoted:

F_{v} = ResNet 18_{trunc} (I) \in R^{C \times h \times w}, C = 512, h = w = 7

(2)

No global average pooling is applied. The tensor maintains spatial resolution, with each cell representing features extracted from a $32 \times 32$ patch in the input image (given the total downsampling factor of 32 across the encoder). This tensor is then reshaped into a sequence of $N = h \times w = 49$ tokens, each of dimensionality $C = 512$ :

T_{v} = {f_{1}, \dots, f_{49}}, f_{i} \in R^{512}

(3)

The reshaping is performed row-wise, flattening the spatial grid into a one-dimensional sequence. Each token $f_{i}$ corresponds to a fixed spatial region in the input image and retains the output of its corresponding location in the $7 \times 7$ grid.

After reshaping, positional information is added to the token sequence. A learnable positional embedding matrix $P \in R^{49 \times 512}$ is initialized randomly from a truncated normal distribution and updated during training. The position-aware token representation is constructed by:

T_{v}^{+} = T_{v} + P

(4)

The addition is performed element-wise for each token index $i \in [1, 49]$ , where the corresponding row $p_{i} \in R^{512}$ from $P$ is added to $f_{i}$ . No sinusoidal or fixed embeddings are used; the entire matrix $P$ is learnable and jointly optimized with other network parameters.

The final output of this stage is a fixed-length visual token sequence $T_{v}^{+} \in R^{49 \times 512}$ , which is stored as the structured representation of the artwork. This sequence is passed to the crossmodal attention module described in Section 3.4, where it interacts with temporally encoded physiological embeddings.

All parameters in the truncated ResNet-18 encoder are initialized with weights pretrained on ImageNet. During training, no layers are frozen; gradients are propagated through the entire encoder. The encoder is optimized jointly with the fusion and classification modules using the Adam optimizer, as described in Section 3.5. No auxiliary loss terms are applied at this stage, and no intermediate supervision is used for token outputs. The visual encoder operates deterministically, with dropout applied only in downstream attention layers.

Physiological feedback representation modeling

The modeling of biosignal sequences in BioArt-Net integrates multimodal temporal dynamics using a parallel signal processing and encoding architecture. As shown in Figure 3, three physiological modalities are considered: electroencephalography (EEG), galvanic skin response (GSR), and photoplethysmography (PPG). These signals are collected simultaneously with image presentation, each sampled at distinct native frequencies.

Figure 3.

Flowchart of the physiological signal modeling module.

The raw signals are first standardized in terms of temporal resolution. EEG signals are sampled at 128 Hz from 14 channels; GSR and PPG signals are recorded at 64 Hz, each from a single channel.³⁰ To unify sampling across modalities, all signals are resampled to a common frequency $f_{s}$ , set to 64 Hz in this implementation to match the slowest modality and reduce computational cost. The resampled data are segmented into windows of $T$ frames corresponding to 30-s time intervals aligned with image presentation timestamps.

Each modality $B_{m}$ is preprocessed independently. The first step includes baseline correction and bandpass filtering. The baseline correction subtracts the mean amplitude computed over a 5-s pre-stimulus window. The bandpass filters are implemented using second-order Butterworth filters specific to each modality. For EEG, the filter band is^8–30 Hz, corresponding to the beta frequency band. The filtering and baseline operation is expressed as:

{\tilde{B}}_{m} = BandPass (B_{m}) ‐ BaselineMean (B_{m})

(5)

Here, the operation is applied independently for each channel within each modality.

The filtered signals are then passed through a channel attention mechanism. For each modality, global average pooling is applied across the temporal dimension to obtain a vector of channel-level summaries. A single-layer dense projection is used to compute attention weights $α_{m} \in$ $R^{C_{m}}$ , where $C_{m}$ is the number of channels in modality $m$ . The attention scores are normalized via softmax:

α_{m}^{i} = Softmax (W_{m}^{T} \cdot GAP ({\tilde{B}}_{m}^{i}))

(6)

These weights are multiplied with the original channel signals to emphasize modality-specific informative components. The resulting weighted signal tensors are then fed into BiGRU, configured with hidden size 128 and depth 1.^31,32 Each time series is processed as:

H_{m} = BiGRU (α_{m} \cdot {\tilde{B}}_{m})

(7)

The BiGRU operates in both forward and backward directions across the temporal dimension, concatenating final hidden states from both directions at each time step. This results in modality-specific embeddings of shape $T \times 128$ for each of EEG, GSR, and PPG.

The outputs from all three BiGRUs are concatenated along the feature dimension at each time step to produce a unified biosignal representation:

H_{bio} = [H_{EEG}; H_{GSR}; H_{PPG}] \in R^{T \times D}

(8)

No pooling or projection is performed at this stage. The matrix $H_{bio}$ is passed directly to the cross-modal fusion module described in Section 3.4.

Cross-modal attention fusion

The fusion of visual and physiological features is implemented using a cross-modal attention mechanism designed to establish dynamic correspondences between spatial visual tokens and temporally evolving biosignal embeddings. The goal of this module is to produce a joint representation in which emotional patterns from both modalities are integrated at the feature level.

Let the visual token sequence $T_{v}^{+} \in R^{N \times C}$ , where $N = 49$ and $C = 512$ , be the output of the visual encoder. Let the physiological embedding sequence from Section 3.3 be denoted as $H_{bio} \in R^{T \times D}$ , where $T$ is the number of physiological time steps (e.g., derived from 30-s segments) and $D$ is the concatenated embedding dimension across EEG, GSR, and PPG modalities. In our configuration, $D = 384$ (i.e., 128-dimensional BiGRU output for each of the three modalities).

Both visual and physiological sequences are projected into a common latent space before cross-attention is performed. Let linear projection layers be defined as $W_{v} \in R^{C \times d}$ projection for visual tokens, $W_{b} \in R^{D \times d}$ projection for biosignal tokens, and $W_{o} \in R^{d \times d}$ output projection after attention.

In this work, the latent dimension is set to $d = 256$ . The visual and physiological sequences are transformed by matrix multiplication:

{\hat{T}}_{v} = T_{v}^{+} \cdot W_{v} \in R^{N \times d}, {\hat{H}}_{bio} = H_{bio} \cdot W_{b} \in R^{T \times d}

(9)

The cross-modal attention is computed such that each physiological time step attends to all visual tokens. For each time step $t \in [1, T]$ , we compute the attention weights over $N$ visual tokens using dot-product attention followed by softmax normalization:

α_{t} = Softmax (\frac{{\hat{H}}_{bio}^{(t)} \cdot {\hat{T}}_{v}^{⊤}}{\sqrt{d}}) \in R^{1 \times N}

(10)

where

{\hat{H}}_{bio}^{(t)} \in R^{1 \times d}

is the projected biosignal embedding at time

t

, and

{\hat{T}}_{v}^{⊤} \in R^{d \times N}

is the transposed matrix of projected visual tokens. The division by

\sqrt{d}

stabilizes the gradients during training as per standard scaled dot-product attention.

The attention weights $α_{t}$ are then used to compute a weighted sum of visual features for each time step:

z_{t} = α_{t} \cdot {\hat{T}}_{v} \in R^{1 \times d}

(11)

This operation is performed independently for all $T$ time steps, producing a fused sequence of attention-aligned vectors ${z_{1}, \dots, z_{T}} \in R^{T \times d}$ . This sequence is concatenated with the original projected physiological embeddings to form the fused representation:

F_{joint} = [z_{1} + {\hat{H}}_{bio}^{(1)}; \dots; z_{T} + {\hat{H}}_{bio}^{(T)}] \in R^{T \times d}

(12)

Optionally, a single-layer feed-forward projection is applied to the fused vectors using $W_{o}$ , followed by layer normalization. The final fused representation is mean pooled over the temporal dimension to produce a global joint embedding vector:

f_{global} = \frac{1}{T} \sum_{t = 1}^{T} F_{joint}^{(t)} \in R^{d}

(13)

This vector is used as input to the classification layer described in the next section.

The cross-modal attention module in our framework is explicitly unidirectional, mapping physiological signals (e.g., EEG and EDA) to visual feature space. This design reflects the assumption that physiological signals are reactive to visual inputs, and not vice versa, in line with embodied aesthetic response theories. Hence, this unidirectional attention mechanism enables the model to selectively integrate visual semantics that elicit physiological changes.

Loss function design

The emotion classification head receives the fused global embedding $f_{global} \in R^{d}$ as input. This vector is passed through a linear classifier with softmax activation:

\hat{y} = Softmax (W_{c} \cdot f_{global} + b_{c}) \in R^{K}

(14)

where

W_{c} \in R^{K \times d}

is the weight matrix of the classifier,

b_{c} \in R^{K}

is the bias vector, and

K

denotes the number of emotion categories.

The primary training objective is the cross-entropy loss between predicted probability $\hat{y}$ and ground-truth one-hot label $y ϵ {0, 1}^{K}$ :

L_{CE} = ‐ \sum_{k = 1}^{K} y_{k} \log ({\hat{y}}_{k})

(15)

In addition to classification loss, three auxiliary regularization terms are introduced to improve the alignment and smoothness of the multimodal representation.

First, a contrastive alignment loss $L_{align}$ is defined between the attended visual features $z_{t}$ and the corresponding biosignal embeddings ${\hat{H}}_{bio}^{(t)}$ . The cosine similarity is maximized for each pair, and minimized for non-corresponding samples within the batch using standard contrastive margin formulation.

Second, a temporal smoothness constraint $L_{temp}$ is applied to the fused sequence $F_{joint}$ by minimizing the L 2 norm between consecutive time steps:

L_{temp} = \sum_{t = 2}^{T} {‖ F_{joint}^{(t)} ‐ F_{joint}^{(t ‐ 1)} ‖}_{2}^{2}

(16)

This term penalizes abrupt shifts in the fused latent trajectory.

Third, an entropy-based attention sparsity loss $L_{ent}$ is computed over the attention distributions $α_{t}$ , encouraging focused alignment over fewer spatial regions:

L_{ent} = ‐ \sum_{t = 1}^{T} \sum_{i = 1}^{N} α_{t}^{(i)} \log α_{t}^{(i)}

(17)

The total training loss is defined as the weighted sum:

L_{total} = L_{CE} + λ_{1} L_{align} + λ_{2} L_{temp} + λ_{3} L_{ent}

(18)

where λ1, λ2, λ3 are scalar weights controlling the contribution of auxiliary losses. In our implementation, these are empirically set to 0.5, 0.1, and 0.01, respectively. Optimization is performed using the Adam optimizer with a learning rate of 1 × 10⁻⁴, weight decay of 1 × 10⁻⁵, and batch size of 32. The model is trained for 100 epochs. No learning rate decay or early stopping is applied unless otherwise specified in ablation experiments.

Visualization and emotional attribution

To facilitate interpretation and analysis of the multimodal learning process, two visualization mechanisms are introduced in the framework: visual saliency maps and temporal affective dynamics. These tools are designed to expose the internal decision mechanisms of the model and provide region- and time-specific attribution.

The visual saliency map is computed by accumulating cross-modal attention weights across all time steps. Let $α_{t} \in R^{N}$ denote the attention distribution over visual tokens at time $t$ . The saliency for each token index $i \in [1, N]$ is given by:

s_{i} = \sum_{t = 1}^{T} α_{t}^{(i)}

(19)

The resulting vector $s \in R^{N}$ is reshaped back to the original $h \times w$ spatial grid ( $7 \times 7$ ), and then upsampled to match the input image size using bilinear interpolation. No additional smoothing is applied. The saliency map is rendered as a heatmap overlay on the original artwork.

In parallel, the temporal dynamics of the affective representation are analyzed by computing the L 2 norm of hidden state changes across consecutive fused states $F_{joint}^{(t)}$ :

Δ_{t} = {‖ F_{joint}^{(t)} ‐ F_{joint}^{(t ‐ 1)} ‖}_{2}

(20)

This results in a sequence ${Δ_{2}, \dots, Δ_{T}}$ that reflects instantaneous variations in the emotional embedding. Peaks in this sequence correspond to moments of physiological fluctuation, possibly triggered by salient visual features. This sequence can be plotted as a temporal signal aligned with stimulus duration to identify segments of strong viewer response.

Experiments and results

Experimental setup

All experiments were conducted on a Linux-based server equipped with an Intel Core i9-13900K processor, 128 GB RAM, and a single NVIDIA RTX 3090 GPU with 24 GB memory. All training was executed using Python 3.10 with PyTorch 2.0.1, CUDA 11.8, and cuDNN 8.9. The deep learning framework was configured for full determinism by disabling non-deterministic cuDNN algorithms and setting the random seed to 42 across Python, NumPy, and PyTorch. The optimizer used in all cases was Adam, with a fixed learning rate of 1 × 10⁻⁴, a weight decay of 1 × 10⁻⁵, batch size 32, and a total of 100 epochs without learning rate decay.

Each training iteration was monitored for GPU utilization and memory occupancy using nvidia-smi, ensuring consistency in resource allocation. During training, validation loss and macro F1 score were logged after each epoch. Model checkpoints were retained based on best validation macro F1.

Dataset description

The proposed method was evaluated on three datasets: a proprietary multimodal dataset BioArt-Emotion, and two public benchmarks—ArtEmis (visual-only) and DEAP (physiology-only)—to validate the model’s effectiveness in both full and partial modality contexts.

BioArt-emotion

This dataset contains 520 paired instances of artwork stimuli and physiological signals, each lasting 30 s. Signals include EEG (14 channels, 128 Hz), GSR (1 channel, 64 Hz), and PPG (1 channel, 64 Hz), resampled to 64 Hz and segmented into T = 1920T = 1920T = 1920 frames per sample. Labels are annotated post-exposure via a 5-point scale and consolidated into 3 categories: positive, neutral, and negative. The data is split into training (70%), validation (10%), and test (20%) sets, with no subject overlap across splits.

ArtEmis

ArtEmis is a visual-only dataset of 80,000 artworks from the WikiArt collection, each annotated with crowd-sourced emotion labels and textual justifications. For this study, a filtered subset of 10,000 artworks covering eight core emotions (awe, amusement, sadness, fear) was used. Images were resized to 224 × 224 times, and labels were mapped to the same 3-class scheme using valence polarity: positive (amusement, awe), neutral (surprise), and negative (sadness, fear).

DEAP

DEAP is a multimodal biosignal dataset comprising EEG and peripheral physiological recordings from 32 subjects watching 40 one-minute video clips. For this study, only the 14 EEG channels and GSR modality were used, downsampled to 64 Hz. Stimuli-level valence annotations were mapped to the same 3-class structure via numeric thresholds (valence >6.5 = positive; <3.5 = negative; else neutral). Each segment was clipped to 30 s from stimulus onset for consistency.

Evaluation metrics

To evaluate the emotion recognition performance of BioArt-Net and baseline models under multiclass classification settings, we adopt four commonly used evaluation metrics: accuracy (ACC), macro-averaged F1 score (F1), Cohen’s kappa coefficient (κ), and area under the receiver operating characteristic curve (AUC). Overall accuracy is defined as the ratio of correctly predicted labels to the total number of samples in the test set. While accuracy provides a general sense of classification correctness, it does not account for inter-class imbalance, which is particularly prevalent in affective datasets where neutral responses often dominate. To address this, macro-averaged F1 score is used to evaluate the harmonic mean of precision and recall for each class independently, followed by averaging across all classes. The macro F1 score is formally defined as:

F 1_{macro} = \frac{1}{K} \sum_{k = 1}^{K} \frac{2 \cdot P_{k} \cdot R_{k}}{P_{k} + R_{k}}

(21)

where

K

is the number of emotion classes and

P_{k}, R_{k}

denote precision and recall for class

k

, respectively. This metric ensures equal weighting across categories regardless of class frequency.

To assess the consistency of predictions beyond chance, Cohen’s kappa coefficient is also computed. It is defined as:

κ = \frac{p_{o} ‐ p_{e}}{1 ‐ p_{e}}

(22)

where

p_{o}

is the observed agreement between predicted and ground-truth labels, and

p_{e}

is the hypothetical probability of random agreement. Kappa is particularly useful in settings with subjective labeling, as it quantifies the reliability of predictions after adjusting for random correctness. In addition, we report the area under the ROC curve (AUC), computed per class in a one-vs-rest fashion and then averaged across classes. AUC captures the ranking quality of the classifier by measuring its ability to distinguish between positive and negative instances per class, regardless of threshold selection.

All evaluation metrics are computed on the held-out test sets for each dataset using scikit-learn version 1.3.0. The macro F1 and AUC metrics use the “macro” averaging option, and all reported results are averaged over five independent runs with different random seeds to account for stochastic training variation. No calibration or post-hoc threshold adjustment is applied prior to evaluation.

Comparative evaluation

To validate the effectiveness of the proposed BioArt-Net framework, we conducted a comprehensive comparative study on the BioArt-Emotion dataset, using it as the common benchmark for all models under evaluation. This ensures consistent input modalities, class distributions, and annotation standards across all experiments. A total of five baseline models were implemented for comparison, each reflecting a distinct modality configuration or fusion strategy. All models were trained and tested under identical conditions, using the same data splits, optimizer, and evaluation protocol as specified in Section 4.1.

The first baseline, VGG19-Emotion, is a visual-only convolutional model, utilizing a 19-layer VGGNet with ImageNet-pretrained weights. The original classifier was replaced with a single-layer softmax classifier for 3-class emotion prediction. This model serves as a unimodal visual reference. The second model, BiGRU-Physio, processes concatenated physiological signals (EEG, GSR, PPG) using a bidirectional GRU with a hidden size of 128, followed by temporal average pooling and dense classification. This model represents the unimodal physiology baseline. For multimodal fusion, three methods were evaluated: (1) Late Fusion,³³ in which separate VGG19 and BiGRU encoders are trained independently and their logits are averaged at inference; (2) Multimodal Transformer,³⁴ which concatenates visual tokens and biosignal embeddings as input to a transformer encoder; and (3) BioArt-Net, the proposed architecture, which incorporates token-level cross-modal attention and auxiliary regularization.

Each model receives as input a 30-s artwork-viewing instance consisting of one artwork image and its corresponding physiological signal trace. Evaluation was conducted on the held-out test partition of the BioArt-Emotion dataset, comprising 104 instances. Metrics included overall accuracy, macro-averaged F1 score, Cohen’s kappa coefficient, and area under the ROC curve. Results were averaged over five independent training runs using different random seeds.

As shown in Table 1, BioArt-Net achieves the highest values across all four evaluation metrics. Compared to unimodal models (VGG19-Emotion and BiGRU-Physio), multimodal integration offers consistent performance gains. Among the fusion approaches, BioArt-Net outperforms both the naive Late Fusion strategy and the transformer-based fusion baseline, indicating the effectiveness of the cross-modal attention module and the auxiliary supervision components in aligning spatial and temporal cues. All models were evaluated under the same resource, preprocessing, and runtime constraints.

Table 1.

Performance of BioArt-Net and baseline models on BioArt-Emotion dataset.

Model	Input modality	Accuracy (%)	Macro F1	κ	AUC
VGG19-Emotion	Visual only	61.2	0.584	0.41	0.661
BiGRU-Physio	Physiology only	64.5	0.609	0.46	0.678
Late fusion	Visual + physio	66.7	0.623	0.49	0.697
Multimodal transformer	Visual + physio	69.1	0.648	0.52	0.728
BioArt-Net (proposed)	Visual + physio	74.3	0.693	0.58	0.781

This comparative evaluation demonstrates that BioArt-Net provides a more effective fusion of visual and physiological information for viewer-centric emotion recognition under consistent multimodal conditions.

Ablation study

To isolate the contribution of individual architectural components and training strategies in the proposed BioArt-Net, an ablation study was conducted on the BioArt-Emotion dataset. All ablation variants were constructed by systematically removing or modifying one module at a time from the full model while keeping the rest of the configuration unchanged. The training pipeline, optimizer, learning rate, and evaluation protocol were kept consistent with the settings described in Section 4.1. Each variant was trained for five independent runs, and average results were reported.

Three ablation settings were designed as follows:

Variant A (w/o Channel Attention): The channel attention mechanism described in Section 3.3 was removed. All physiological signals were passed directly to the BiGRU encoders without channel-wise reweighting. The BiGRUs retained identical configuration (hidden size 128, single layer).

Variant B (w/o Cross-Modal Attention): The cross-modal attention fusion module in Section 3.4 was replaced with a simple feature concatenation approach. The visual token sequence $T_{v}^{+}$ and physiological embedding $H_{bio}$ were flattened independently and concatenated along the feature dimension, followed by a single feed-forward layer before classification.

Variant C (w/o Auxiliary Losses): The auxiliary regularization losses introduced in Section 3.5, including the contrastive alignment loss $L_{align,}$ , temporal smoothness loss $L_{temp}$ , and entropy-based attention sparsity loss $L_{ent}$ , were all disabled. The total training loss was reduced to the standard cross-entropy classification loss $L_{CE}$ only.

Each variant was trained and evaluated under the same data partition as the full BioArt-Net model. Performance was measured using accuracy, macro-averaged F1 score, Cohen’s kappa, and AUC, as shown in Table 2.

Table 2.

Ablation study on BioArt-Emotion dataset.

Model variant	Channel attention	Cross-modal attention	Regularization losses	ACC (%)	F1	κ	AUC
BioArt-Net (full model)	✓	✓	✓	74.3	0.693	0.58	0.781
Variant A (–CA)	✗	✓	✓	70.4	0.658	0.52	0.741
Variant B (–CM)	✓	✗	✓	68.1	0.632	0.49	0.712
Variant C (–RegLoss)	✓	✓	✗	71.6	0.667	0.54	0.752

From a quantitative perspective, removing the cross-modal attention module (Variant B) resulted in the most significant decline in performance, with a reduction of 6.2% in accuracy, 6.1 points in macro F1, and 6.9 points in AUC compared to the full model. This suggests that direct concatenation of modalities leads to suboptimal alignment between visual and biosignal features, likely due to the absence of localized attention computation. Variant A, which removes channel attention, shows a 3.9% drop in accuracy and a 3.5-point decrease in F1, indicating that uniform treatment of biosignal channels limits the model’s ability to emphasize informative physiological components (e.g., frontal EEG or peak GSR). Disabling the auxiliary loss terms (Variant C) had a comparatively smaller but consistent impact across metrics, with 2.7% accuracy loss and 2.6-point drop in F1, indicating that the regularization losses enhance representational stability rather than core performance. Across all configurations, the standard deviation of macro F1 over five runs remained within ±0.015, confirming the stability of performance differences. No changes were made to model depth, hidden dimensions, or modality preprocessing across variants.

Table 3 reports class-wise performance on the BioArt-Emotion test set. While performance is consistent across classes, the model performs best on positive emotions (F1 = 0.721) and less reliably on neutral instances (F1 = 0.642), reflecting known ambiguity in affectively neutral stimuli. This confirms the need for macro F1 alongside accuracy.

Table 3.

Per-class precision/recall/F1 for BioArt-Net.

Class	Precision	Recall	F1
Positive	0.748	0.696	0.721
Neutral	0.622	0.664	0.642
Negative	0.719	0.684	0.701

To evaluate the robustness of loss weighting, we varied λ₁–λ₃ across ±50% ranges and observed performance on the validation set. As shown in Table 4, macro F1 remained within ±1.2 points of the default across combinations. Notably, reducing λ₃ to 0 suppressed attention sparsity (entropy ↑), while increasing λ₂ > 0.2 overly smoothed temporal dynamics. These results validate the stability of our chosen weight set.

Table 4.

Sensitivity analysis of λ values.

λ₁	λ₂	λ₃	F1	Attention entropy ↓
0.3	0.1	0.01	0.682	2.45
0.5	0.1	0.01	0.693	2.41
0.5	0.2	0.01	0.688	2.51
0.5	0.1	0.00	0.684	2.63

Visualization-based validation

Trajectory statistics

To further examine the alignment between visual attention allocation, internal emotional representation changes, and physiological fluctuations during inference, multiple forms of visualization-based validation were conducted on the BioArt-Emotion test set. The objective was to analyze the spatial and temporal behaviors of BioArt-Net in comparison with two ablation variants and to quantify attention dynamics over a representative sample set.

To assess the consistency and discriminative capacity of BioArt-Net across test samples, we conducted a batch-level quantitative projection analysis based on attention and trajectory statistics. The analysis was performed on 30 stratified samples from the BioArt-Emotion test set, each corresponding to a distinct subject-artwork pair. The results are shown in Figure 4.

Figure 4.

Visual quantitative projection results of the BioArt-Net model.

First, we evaluated the entropy of the cumulative visual attention distribution. For each sample, token-level attention weights over the 7 × 7 visual grid were aggregated across all time steps and normalized to form a probability distribution. The Shannon entropy of each distribution was calculated to reflect the degree of spatial focus. As shown in the left panel of Figure 4, most samples exhibited attention entropy values in the range of 2.0 to 2.7, with a mean of 2.41 and standard deviation of 0.26. These values indicate that the model systematically allocated higher attention mass to a limited subset of spatial regions, suggesting focused visual responses across instances.

Second, the central panel of Figure 4 reports the number of high-magnitude peaks in the Δt trajectory for each sample. A peak was defined as a local maximum exceeding one standard deviation above the sample’s baseline Δt signal. On average, each sample contained 4.2 such peaks (standard deviation 1.3), with most counts falling within the interval.^3,6 This reflects the model’s sensitivity to temporal variations in multimodal information, identifying multiple transitions in fused emotional representation during the 30-s stimulus window.

Third, the timing of the first significant Δt peak was extracted for each sequence. The right panel of Figure 4 summarizes the distribution of initial peak occurrences across the 30 samples. The majority of sequences (76.7%) exhibited their first Δt peak within the 3–7 s interval following stimulus onset. A smaller subset (13.3%) exhibited delayed initial responses occurring after 10 s. These observations suggest that BioArt-Net’s internal dynamics are responsive within the early stages of the stimulus viewing period, though subject-dependent latency effects remain present.

Temporal representation and GSR signal alignment

The fused representation trajectory $F_{joint}^{(t)}$ was extracted over time for BioArt-Net and its ablated variants, and the normed difference $Δ_{t} = {‖ F_{joint}^{(t)} ‐ F_{joint}^{(t ‐ 1)} ‖}_{2}$ was computed to quantify was computed to quantify instantaneous emotional state changes. Figure 5 presents the Δt curves alongside normalized GSR amplitude for a single artwork instance (“The Starry Night”).

Figure 5.

Comparison of emotion representation time traces (Δt) of BioArt-Net and its two ablation variants with normalized GSR curves on the same artistic image sample (the starry night).

The full BioArt-Net model yielded a smoother Δt profile with four major peaks, each temporally aligned (within 400 ms) to sharp GSR fluctuations. In contrast, the w/o Cross-Modal Attention variant displayed earlier but inconsistent Δt shifts, while the w/o Channel Attention variant exhibited delayed and noisier transitions, lacking distinct correlation with GSR waveform morphology. Across the test set, BioArt-Net maintained an average Δt–GSR peak lag of 320 ms (SD = 0.16 s), compared to 670 ms and 510 ms for the two ablated models, respectively.

The Δt curve produced by the full BioArt-Net model exhibited four distinct peaks during the 30-s interval, with the first occurring at approximately 5.4 s. These peaks aligned closely with local maxima in the GSR signal, with a mean offset of 320 ms across all peaks. Compared to ablation variants, the full model demonstrated higher peak-to-peak separation and less noise-level fluctuation. Specifically, the “w/o Cross-Modal Attention” variant produced Δt peaks with less consistent shape and earlier average onset (3.1 s), while the “w/o Channel Attention” variant showed reduced peak amplitude and shifted peak positions toward later intervals (e.g., first peak >8 s).

For a batch-level analysis, we computed Pearson correlation coefficients between Δt and GSR signals across 30 test samples. BioArt-Net achieved a mean correlation of r = 0.42 (SD = 0.09), compared to 0.28 and 0.33 for the cross-modal and channel attention ablations respectively. Furthermore, BioArt-Net maintained a higher average signal-to-noise ratio (SNR = 3.78) in its Δt trajectory, indicating clearer state transitions.

These results suggest that the fused embedding trajectory derived from BioArt-Net more accurately reflects meaningful transitions in the subject’s physiological arousal state, and that both channel-level attention modulation and token-level fusion contribute to the temporal responsiveness of the model.

Spatial visualization across models

To visualize attention distributions across different model structures, Figure 6 displays overlayed heatmaps on “The Starry Night” for five test samples. Each image is superimposed on the original art image by interpolating the attention weights accumulated across time steps during the model prediction, which clearly reveals the response structure of the model to the spatial visual area. From the heat map distribution, it can be observed that the model allocates significant attention to stars, highlighted vortex structures, central spires, and dark foreground contour regions in the image. Rows correspond to Late Fusion, Multimodal Transformer, and BioArt-Net models. The BioArt-Net overlays demonstrated consistent concentration on semantically and affectively salient regions such as the spiral vortex and star clusters. By contrast, the Late Fusion model generated dispersed and irregular activation patterns with no consistent localization, while the Transformer baseline showed moderate focus but less structural regularity. Heatmaps were generated using the interpolated attention matrix and overlayed via linear-weighted fusion with the original artwork image. All images were resized to 224 × 224 pixels, and watermark artifacts were masked prior to rendering.

Figure 6.

Visual attention comparison.

These results collectively validate that the full BioArt-Net model exhibits more temporally aligned representation dynamics and more concentrated spatial attention distributions compared to both unimodal and alternative multimodal architectures.

The attention map thresholded at 80th percentile was compared to unioned human masks. The average IoU score was 0.64 (SD = 0.07), with inter-rater agreement at 0.72. These results confirm that BioArt-Net reliably highlights regions judged important by experts.

Conclusion and future work

This work presents BioArt-Net, a deep fusion framework that integrates spatial visual features and temporal physiological signals to model viewer-specific emotional responses during art perception. The architecture introduces cross-modal attention to bridge visual tokens and biosignal embeddings and employs auxiliary loss terms to reinforce alignment and regularization. Evaluation on the BioArt-Emotion dataset shows that the proposed model achieves notable improvements in classification accuracy, macro F1 score, and interpretability compared to both unimodal and alternative multimodal designs. Attention visualizations and temporal trajectory analyses reveal that the model selectively attends to emotionally salient visual regions and exhibits trajectory shifts aligned with physiological arousal.

However, the BioArt-Emotion dataset primarily features Western canonical artworks, which may limit cross-cultural generalizability. Additionally, the relatively small participant pool constrains the robustness of individual-level modeling. Future work will incorporate more culturally diverse art collections and a larger, demographically varied subject base to improve generalization.

Footnotes

Acknowledgments

I thank the anonymous reviewers whose comments and suggestions helped to improve the manuscript.

ORCID iD

Xinqiao Hu

Author contributions

Xinqiao Hu conceived and designed the study, collected and analyzed the data, drafted the manuscript, and revised it critically for important intellectual content. The author has read and approved the final manuscript.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Wang

Zhao

, et al. Unlocking the emotional world of visual media: an overview of the science, research, and impact of understanding emotion. Proc IEEE 2023; 111(10): 1236–1286.

Green

. Affective expression in the visual arts. In: The expression of emotion in the visual arts. Routledge, 2024, pp. 46–66.

Chowdhury

Liu

Ramanna

. Simple histogram equalization technique improves performance of VGG models on facial emotion recognition datasets. Algorithms 2024; 17(6): 238.

Ronickom

JFA

. Enhancing emotion recognition: machine learning with phasic spectrogram texture features. 2023 IEEE 5th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA). Hamburg, Germany: IEEE, 2023, pp. 600–603.

Yamamoto

Takemura

. EMMA: emotion mixing algorithm for compound expression recognition using angle-based metric learning. In Proceedings of the Asian Conference on Computer Vision (ACCV 2024) Workshops. Hanoi, Vietnam: Springer, 2024, pp. 495–510.

Chen

. Classification of artistic styles of Chinese art paintings based on the CNN model. Comput Intell Neurosci 2022; 2022(1): 4520913.

Shi

. An image classification approach for painting using improved convolutional neural algorithm. Soft Comput 2024; 28(1): 847–873.

Gamage

Kalansooriya

Sandamali

ERC

. An emotion classification model for driver emotion recognition using electroencephalography (EEG). 2022 international research conference on Smart Computing and Systems Engineering (SCSE). Colombo, Sri Lanka: IEEE, 2022, 5, pp. 76–82.

Ismail

SNMS

Aziz

NAA

Ibrahim

. A comparison of emotion recognition system using electrocardiogram (ECG) and photoplethysmogram (PPG). J King Saud Univ-Comput Inf Sci 2022; 34(6): 3539–3558.

10.

Patil

Pawar

Randive

, et al. From face detection to emotion recognition on the framework of raspberry pi and galvanic skin response sensor for visual and physiological biosignals. J Elect Syst Inf Technol 2023; 10(1): 24.

11.

Bansal

Goyal

Choudhary

. A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decision Analytics J 2022; 3: 100071.

12.

Bouazizi

benmohamed

Ltifi

. Decision-making based on an improved visual analytics approach for emotion prediction. Intell Decis Technol 2023; 17(2): 557–576.

13.

Hwooi

SKW

Othmani

Sabri

AQM

. Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 2022; 10: 96053–96065.

14.

Joudeh

Cretu

Bouchard

. Predicting the arousal and valence values of emotional states using learned, predesigned, and deep visual features. Sensors 2024; 24(13): 4398.

15.

Yang

Chen

. Art appreciation model design based on improved PageRank and ECA-ResNeXt50 algorithm. PeerJ Comput Sci 2023; 9: e1734.

16.

Amrani

Adadi

Berrada

. An attention mechanism-based interpretable model for epileptic seizure detection and localization with self-supervised pre-training. IEEE Access 2025; 16: 60213–60232.

17.

Lin

. Review of studies on emotion recognition and judgment based on physiological signals. Appl Sci 2023; 13(4): 2573.

18.

Ramaswamy

MPA

Palaniswamy

. Multimodal emotion recognition: a comprehensive review, trends, and challenges. WIREs Data Min & Knowl 2024; 14(6): e1563.

19.

Arslan

Akşahin

Yilmaz

, et al. Towards emotionally intelligent virtual environments: classifying emotions through a biosignal-based approach. Appl Sci 2024; 14(19): 8769.

20.

Shi

Zheng

Zhang

, et al. A study of subliminal emotion classification based on entropy features. Front Psychol 2022; 13: 781448.

21.

Almanza-Conejo

Almanza-Ojeda

Contreras-Hernandez

, et al. Emotion recognition in EEG signals using the continuous wavelet transform and CNNs. Neural Comput Appl 2023; 35(2): 1409–1422.

22.

Siam

El-Shafai

Abou Elazm

, et al. Enhanced user verification in IoT applications: a fusion-based multimodal cancelable biometric system with ECG and PPG signals. Neural Comput Appl 2024; 36(12): 6575–6595.

23.

Jiang

Kumar

TBJ

, et al. The role of cognitive factors in consumers’ perceived value and subscription intention of video streaming platforms: a systematic literature review. Cogent Bus Manag 2024; 11(1): 2329247.

24.

Morgado

. Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, Washington, USA: IEEE, 2024, pp. 27186–27196.

25.

Zhu

Zhang

, et al. Multimodal sentiment analysis based on fusion methods: a survey. Inf Fusion 2023; 95: 306–325.

26.

Ghaith

. Deep context transformer: bridging efficiency and contextual understanding of transformer models. Appl Intell 2024; 54(19): 8902–8923.

27.

Aloi

Bouzit

. Biometric methods for user research: three case studies. In: Extended abstracts of the CHI conference on human factors in computing systems, 2024, pp. 1–8.

28.

Yao

Qian

, et al. High-accuracy classification of multiple distinct human emotions using EEG differential entropy features and ResNet18. Appl Sci 2024; 14(14): 6175.

29.

Helaly

Messaoud

Bouaafia

, et al. DTL-I-ResNet18: facial emotion recognition based on deep transfer learning and improved ResNet18. Signal Image Video Process 2023; 17(6): 2731–2744.

30.

Banik

Kumar

Ganapathy

, et al. Exploring central-peripheral nervous system interaction through multimodal biosignals: a systematic review. IEEE Access 2024.

31.

Qiao

Zhou

, et al. A BiGRU joint optimized attention network for recognition of drilling conditions. Pet Sci 2023; 20(6): 3624–3637.

32.

Cui

, et al. Deep heuristic evolutionary regression model based on the fusion of BiGRU and BiLSTM. Cognit Comput 2023; 15(5): 1672–1686.

33.

Hung

Thu

NHM

. Novelty fused image and text models based on deep neural network and transformer for multimodal sentiment analysis. Multimed Tool Appl 2024; 83(25): 66263–66281.

34.

Wang

, et al. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing 2024; 572: 127181.