Biomechanical analysis of cinematic motion: AI-driven generation and evaluation in film and animation

Abstract

Human motion synthesis plays a central role in film and animation, where motion quality influences both narrative coherence and perceptual realism. While data-driven deep learning models have shown promise in automating motion generation, they often lack biomechanical fidelity, leading to physically implausible results such as limb distortion and foot sliding. To address these challenges, we propose BCMG-Net, a Biomechanically Constrained Motion Generation Network that embeds anatomical and kinetic priors into a Transformer-based architecture. Our model integrates bone length preservation, dynamic smoothness, and energy efficiency constraints directly into the training objective, ensuring structural consistency and motion naturalness. Moreover, semantic control vectors enable context-aware generation for diverse cinematic actions. Experiments conducted on Human3.6 M, CMU MoCap, and a curated film motion dataset demonstrate that BCMG-Net outperforms state-of-the-art baselines across multiple biomechanical and perceptual metrics. Joint range heatmaps, center of mass trajectories, and motion embedding analyses further validate the physical coherence of the generated motion. These results establish BCMG-Net as a practical and principled framework for physically grounded motion synthesis in high-fidelity digital storytelling.

Keywords

human motion synthesis biomechanical constraints transformer film animation motion evaluation physical plausibility AI-driven generation

Introduction

Human motion plays a foundational role in visual storytelling, interactive environments, and digital character animation. In film, animation, and immersive media, the expressive power of bodily movement transcends mere kinematics—it encapsulates emotion, intention, personality, and narrative causality. The nuances of posture, timing, symmetry, and rhythm are essential for communicating physical context and character identity.^1,2 Even subtle variations in joint alignment or movement tempo can dramatically affect an audience’s emotional interpretation, making motion quality a critical determinant of visual realism, believability, and narrative engagement in virtual productions. Historically, animators have relied on manual keyframing, a labor-intensive process in which artists define joint poses at specific intervals and interpolate between them to generate motion.^3–5 While this technique affords significant creative control, it is limited in scalability and struggles to reproduce the fine-grained, coordinated dynamics characteristic of natural human movement. Alternatively, motion capture (MoCap) systems have become a standard in high-end production pipelines due to their ability to record precise 3D pose data from human performers.⁶ However, MoCap introduces several drawbacks: it is logistically complex, incurs high financial costs, requires specialized equipment and controlled environments, and often fails to generalize across characters, motion styles, or narrative contexts. Moreover, MoCap data remains constrained by the physical limitations of the recorded performances, limiting its applicability in stylized or exaggerated animation.

In response to these constraints, the research community has turned to data-driven human motion generation, aiming to synthesize motion sequences automatically from visual, textual, or symbolic inputs. Powered by advances in deep learning, these approaches have demonstrated promising results in generating realistic, temporally coherent motion.⁷ In particular, Transformer-based architectures⁸ have emerged as a leading paradigm, owing to their ability to model long-range temporal dependencies and capture global spatio-temporal structure. Models such as MotionBERT,⁹ ActFormer,¹⁰ and MotionDiffuse¹¹ have achieved state-of-the-art results in motion prediction and style-conditioned synthesis, pushing the boundaries of motion realism.

However, despite these advances, a persistent limitation remains: most neural models prioritize positional plausibility over physical validity. That is, while the generated motion may appear visually acceptable at first glance, it often suffers from underlying biomechanical inconsistencies. Common issues include unnatural joint rotations, bone length deformations, foot sliding, and abrupt acceleration spikes—artifacts that may go unnoticed in casual viewing but become immediately apparent in professional or biomechanical contexts.^12,13 These inconsistencies are particularly problematic in high-fidelity animation, clinical simulation, and digital human embodiment, where motion must adhere to both esthetic and anatomical principles. A critical analysis of these limitations reveals that they stem primarily from the absence of explicit physical constraints in the model training process. While certain efforts have integrated heuristics such as foot contact detection, phase-guided timing, or inverse kinematics post-processing, these are typically applied as external corrections rather than embedded within the learning objective. As such, they fail to shape the internal motion representation in a physically meaningful way. Few approaches to date have incorporated differentiable biomechanical priors—such as constraints on bone length constancy, inertial continuity, or energy minimization—into end-to-end learning frameworks. Even fewer evaluate their models using physically grounded criteria such as joint torque plausibility, ground reaction symmetry, or metabolic efficiency proxies. As a result, a persistent gap remains between perceptual realism and biomechanical validity, which continues to limit the deployment of neural motion models in physically critical applications.

To address this, we propose the Biomechanically Constrained Motion Generation Network (BCMG-Net), a novel framework that embeds core biomechanical priors into a deep motion generation architecture. Our method integrates spatio-temporal feature encoding, Transformer-based motion synthesis, and a suite of differentiable constraints that enforce: (i) skeletal structural consistency via bone length preservation, (ii) kinetic smoothness via temporal acceleration regularization, and (iii) energy efficiency via displacement-aware penalties. In addition to physical plausibility, BCMG-Net supports semantic conditioning, enabling controllable motion generation aligned with action labels or scene contexts. This makes our model particularly suitable for film-level animation, interactive character systems, and digital avatar creation, where motion must be both meaningful and biomechanically sound. In summary, our contributions are threefold:

(1) We introduce a novel motion generation framework that integrates biomechanical constraints into a Transformer-based architecture, ensuring physical plausibility without sacrificing flexibility or expressiveness.

(2) We design and incorporate differentiable loss functions for bone length, dynamics, and energy usage, enabling end-to-end learning of structurally valid and kinetically coherent motion.

(3) We conduct extensive quantitative and qualitative evaluation, including perceptual studies and biomechanical visualizations, confirming the practical value and scientific validity of our approach.

Related work

Human motion modeling

Early approaches to human motion modeling were predominantly based on rule-based systems, including inverse kinematics, spline interpolation, and parametric models.^14,15 While these methods allowed precise control over local pose transitions, they lacked the capacity to model global temporal dynamics and complex joint coordination, rendering them insufficient for generating naturalistic or long-term motions With the rise of deep learning, researchers began leveraging sequence models—such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks,¹⁶ and Temporal Convolutional Networks (TCNs)—for motion prediction and synthesis.¹⁷ These approaches significantly improved the modeling of temporal dependencies and variability, enabling the generation of smoother and more temporally consistent motion sequences. However, recurrent models often suffer from long-term drift and short-range oversmoothing, especially in complex or expressive motion scenarios.

More recently, Transformer-based architectures have become dominant in the field due to their superior capacity for modeling long-range dependencies and global context. MotionBERT⁹ introduced masked motion modeling and pretraining strategies for generalizable motion representations. ActFormer¹⁰ and MotionDiffuse¹¹ explored conditional synthesis through attention-based modules and diffusion mechanisms, respectively. Despite their advances in positional accuracy and temporal structure, these models often prioritize perceptual coherence while neglecting physical plausibility, leading to artifacts such as unnatural joint velocities, implausible contact dynamics, or bone length distortion. The absence of explicit biomechanical modeling remains a key shortcoming across many of these architectures.

Biomechanical constraints in motion synthesis

Integrating biomechanical knowledge into data-driven motion generation remains an underexplored yet crucial direction. Some early efforts introduced soft constraints or heuristics to improve realism—for example, Wang et al.¹⁸ employed phase-functioned neural networks to stabilize foot-ground interaction, while Zhang et al.¹⁹ introduced contact-aware filtering and trajectory warping. More recent works have begun embedding physical considerations into the learning process itself. For example, Srifi et al.,²⁰ proposed combining differentiable physics simulators with diffusion-based motion models to enforce physical realism in generated sequences. Feng et al.²¹ introduced energy regularization and symmetric limb dynamics to stabilize gait cycles. However, these approaches are often task-specific (e.g., limited to locomotion) and not easily generalizable to diverse or stylized motion. Furthermore, many rely on modular post-processing or external physical engines rather than fully differentiable constraints embedded within the learning pipeline. Another recent example is the PoseCrafter,²² which introduces kinematic consistency terms for pose estimation, but does not generalize to generative models. Similarly, Penichet²³ integrates biomechanical losses but focuses primarily on rehabilitation and lacks perceptual flexibility required in animation domains.

In contrast, our work proposes a unified framework that integrates differentiable, end-to-end biomechanical constraints—including bone length preservation, acceleration smoothness, and energy efficiency—directly into the loss function. This allows the model to implicitly learn physically coherent priors without reliance on post-hoc correction or domain-specific handcrafting.

Semantic control and stylization

Controllability is a critical requirement for practical deployment of motion synthesis systems, especially in interactive media and film animation.²⁴ Numerous studies have focused on action-conditioned motion generation, where symbolic labels (e.g., “walk,” “run,” and “jump”) or continuous latent codes are used to guide output generation. Action and Style Motion^25,26 allow for control over motion style or action category via learned embeddings. Such models increase diversity and coherence of motion generation, yet they often sacrifice physical grounding in favor of expressive flexibility. Recent trends in language-driven motion generation further amplify the expressive range. PromptGen²⁷) and MotionGPT²⁸ introduce large language models to parse human instructions into motion sequences, achieving impressive semantic alignment. However, these models lack mechanisms to enforce anatomical plausibility or biomechanical fidelity, and may generate stylistically compelling yet physically implausible outputs.

Our framework bridges this gap by combining semantic control with biomechanical constraints. Specifically, we condition Transformer-based generation on discrete action labels while enforcing structure-aware losses throughout the training process. This enables context-aware generation that is both controllable and biomechanically grounded—an essential quality for applications in virtual production, film animation, and physical simulation.

Methods

Overall framework

We propose a novel model called Biomechanically Constrained Motion Generation Network (BCMG-Net). This model aims to synthesize naturalistic and biomechanically valid motion sequences by combining spatial-temporal feature modeling with explicit biomechanical constraints, thereby bridging data-driven learning with physical plausibility. The overall architecture of BCMG-Net is illustrated in Figure 1, which demonstrates the sequential data flow from raw video frames to motion generation under biomechanical constraints.

Figure 1.

The overall framework of the proposed BCMG-Net.

The model consists of five integrated modules: (1) data preprocessing and pseudo-3D skeleton reconstruction, (2) spatio-temporal feature encoding using ResNet and Bi-LSTM, (3) biomechanical constraint embedding, (4) semantic-controlled motion generation via Transformer, and (5) visualization and physical evaluation. This design ensures the generated motion is not only realistic but biomechanically plausible.

Data input and pose modeling

The expression of character movements in film and television works is characterized by complexity and diversity. To achieve accurate modeling and generation of these movements, the input data fed into the model must fully and precisely capture the spatial structure and temporal features of human motion. Given that existing audiovisual datasets often lack standardized 3D annotation information, this study designs a preprocessing workflow to extract and reconstruct skeletal motion data from multisource inputs. A series of normalization techniques are applied to establish a unified representation of human posture.

This research utilizes two types of data as the foundation for model training and evaluation. The first type consists of structured motion capture datasets (such as Human3.6 M and the CMU MoCap dataset), which provide precise 3D skeletal coordinates of the human body and serve as a solid basis for physical modeling and supervised learning. The second type includes unstructured real-world video data from film and television, which are transformed into a model-recognizable format via keypoint extraction and pose estimation techniques, thereby enhancing the model’s ability to adapt to complex motion scenarios in real environments.

During the keypoint extraction stage, this study incorporates pose estimation models based on deep neural networks—OpenPose and AlphaPose^29,30—to extract 2D skeletal coordinates frame-by-frame from input video sequences.

As shown in Figure 2, the model transforms the 2D keypoint sequence $P_{t}$ extracted from raw frames into a normalized pseudo-3D skeleton $X_{t}$ . Orange nodes denote 2D joint positions on the image plane, while blue nodes represent their reconstructed spatial locations after temporal and anatomical refinement.

Figure 2.

Transformation pipeline from 2D keypoints to pseudo-3D skeleton.

The pseudo-3D reconstruction estimates depth via monocular lifting, leveraging temporal priors and anatomical constraints, following methods similar to.^29,30 Each frame image is transformed into a set of 2D human keypoints:

P_{t} = {(x_{i}^{t}, y_{i}^{t})}_{i = 1}^{N}

(1)

Here, $N$ denotes the number of joints in the skeleton, and $(x_{i}^{t}, y_{i}^{t})$ represents the coordinate position of the $i$ -th keypoint on the image plane in frame $t$ . While 2D keypoints can describe the basic pose structure of a character within the plane, they are insufficient to capture depth variations and spatial orientation of movements.

To address this limitation, this study further proposes a pseudo-3D reconstruction mechanism. In the absence of high-precision 3D annotations, this mechanism estimates the relative spatial position of each keypoint by leveraging the temporal continuity of keypoints across video frames and incorporating anatomical prior relationships. The resulting 3D skeletal sequence is represented as:

X_{t} = {(x_{i}^{t}, y_{i}^{t}, z_{i}^{t})}_{i = 1}^{N}

(2)

Here,

z_{i}^{t}

denotes the estimated depth value of the

i

-th keypoint in frame

t

. This representation preserves the 3D geometric characteristics of human motion, thereby enabling the application of physical constraint modeling in subsequent stages.

After completing the 3D pose modeling, the skeletal sequences undergo a normalization process to enhance data consistency and modeling stability. First, the skeleton center of each frame—typically the hip joint—is translated to the origin of the coordinate system, eliminating displacement noise and the effects of camera translation. Second, all keypoint coordinates are uniformly scaled so that the overall skeleton length is normalized to a unit interval:

{\tilde{x}}_{i}^{t} = \frac{x_{i}^{t} - x_{root}^{t}}{l_{\max}}

(3)

Here, $x_{root}^{t}$ represents the reference point in the current frame (e.g., the pelvis center), and $l_{\max}$ is the predefined maximum skeleton length. This normalization process ensures that the input data remain comparable across different individuals and scenes, preventing the model from being overly sensitive to scale variations.

To further mitigate frame-to-frame jitter and estimation errors, this study applies the Savitzky–Golay filtering algorithm to perform temporal smoothing on the keypoint sequences. This algorithm reconstructs the trajectory of each keypoint by fitting a polynomial curve within a local temporal window, thereby preserving the motion rhythm and overall trends while effectively suppressing high-frequency noise and short-term fluctuations.

After the above preprocessing steps, all pose data are uniformly encoded in the form of a 3D tensor:

X = [X_{1}, X_{2}, \dots, X_{T}] \in R^{T \times N \times 3}

(4)

Here, $T$ denotes the number of frames in the motion sequence, $N$ represents the number of keypoints, and 3 corresponds to the three spatial coordinate axes. This data format is well-suited for frame-level modeling using convolutional neural networks, while also facilitating temporal networks in capturing dynamic features across frames.

Motion representation and constraint design

Human motion arises from both spatial posture and temporal evolution. Therefore, effective modeling must capture intra-frame skeletal structure and inter-frame dynamics. In BCMG-Net, we introduce a joint spatio-temporal encoding module that combines a convolutional encoder with a sequence modeling network.

For spatial encoding, we employ ResNet-50,³¹ a residual convolutional network, to process each input frame’s pose image or heatmap, extracting local structural features of joint arrangements:

f_{t}^{(s)} = ResNet (I_{t})

(5)

Where

f_{t}^{(s)} \in R^{d}

is the spatial embedding of frame

t

. To capture the motion dynamics, we use a Bidirectional Long Short-Term Memory (Bi-LSTM) network to encode forward and backward motion context:

F_{seq} = BiLSTM (f_{1}^{(s)}, f_{2}^{(s)}, \dots, f_{T}^{(s)})

(6)

Here, $F_{seq} \in R^{T \times h}$ encodes both structural and rhythmic features. This joint representation allows the model to learn not only static poses but also temporal regularities such as transitions and action rhythm. The Bi-LSTM³² is constructed with two stacked layers and dropout regularization to enhance generalization and prevent overfitting. This module transforms the raw pose data into a high-dimensional, context-aware representation suitable for both biomechanical reasoning and conditional synthesis.

Biomechanical constraint modeling

In motion generation tasks driven by deep learning, merely relying on network approximation often results in motion sequences that appear “visually acceptable” but deviate from the biomechanical laws that govern actual human movement. This can lead to unrealistic joint distortions, temporal discontinuities, or physically implausible motion transitions. To address these issues, BCMG-Net introduces a series of explicit biomechanical constraints during training to enforce physical plausibility.

These constraints are designed from three perspectives: (1) skeletal structural consistency, (2) dynamic continuity, and (3) energy-efficient motion planning. Together, they form a comprehensive biomechanical framework to ensure the naturalness, stability, and efficiency of generated motions.

Skeletal consistency constraint

In human physiology, the distances between adjacent joints (bone lengths) remain constant during motion. To maintain anatomical correctness, we introduce a skeletal consistency loss that penalizes deviations in bone lengths from their reference values:

L_{bone} = \sum_{t = 1}^{T} \sum_{(i, j) \in E} {({∥ x_{i}^{t} - x_{j}^{t} ∥}_{2} - l_{i j}^{0})}^{2}

(7)

Here,

E

represents the set of connected joint pairs, and

l_{i j}^{0}

is the average bone length computed from training data. This constraint ensures that the synthesized skeleton maintains proportional integrity and avoids biologically implausible deformation such as “bone stretching” or “collapsing.”

Dynamic smoothness constraint

Human motion must comply with inertia and acceleration continuity. To enforce temporal consistency, we introduce a dynamic smoothness loss based on the second-order difference across time frames, which serves as an approximation of acceleration:

L_{dyn} = \sum_{t = 2}^{T - 1} \sum_{i = 1}^{N} {∥ x_{i}^{t + 1} - 2 x_{i}^{t} + x_{i}^{t - 1} ∥}_{2}^{2}

(8)

This loss penalizes abrupt velocity changes, effectively promoting smooth transitions and preventing jerky or unnatural motion, especially in high-frequency body parts such as arms and ankles.

Energy efficiency constraint

From the perspective of biomechanics and neuromotor control, the human body aims to minimize unnecessary energy expenditure during movement. To reflect this principle, we propose an energy-based constraint that encourages motion sequences with minimal displacement redundancy:

L_{energy} = \sum_{t = 1}^{T - 1} \sum_{i = 1}^{N} {∥ x_{i}^{t + 1} - x_{i}^{t} ∥}_{2}

(9)

This term penalizes excessive or redundant joint movement and promotes coherent and efficient motion trajectories.

Total loss function

The final training objective combines the basic reconstruction loss with the above biomechanical constraints, yielding the total loss function:

L_{total} = L_{recon} + λ_{1} L_{bone} + λ_{2} L_{dyn} + λ_{3} L_{energy}

(10)

Here, $L_{recon}$ measures the mean squared error (MSE) between generated and ground truth joint positions, while $λ_{1}, λ_{2}, λ_{3}$ are hyperparameters controlling the influence of each biomechanical constraint. This composite loss guides the network to generate motions that are not only datad-riven but also physiologically and physically valid. We empirically set λ_bone = 0.5, λ_dyn = 0.3, and λ_energy = 0.2 based on validation performance. A sensitivity analysis showed stable results within ±0.1 range for each coefficient.

Conditional motion generation and semantic control

In real-world animation production, characters are often required to perform actions with clearly defined semantics, such as “wave goodbye,” “run forward,” or “stand up from sitting.” While these actions may vary in appearance, they are semantically consistent. Thus, enabling the model to generate motion under semantic control improves both usability and practical flexibility.

To this end, BCMG-Net integrates a conditional motion generation module based on semantic control vectors. This module takes a label embedding vector $c \in R^{d}$ , representing an action instruction, and fuses it with the extracted spatio-temporal feature sequence $F_{seq}$ to generate a motion sequence:

\hat{X} = G (F_{seq}, c)

(11)

Here, $G (\cdot)$ represents a Transformer-based generator³³ that uses multi-head self-attention to learn long-range temporal dependencies while adhering to the guidance of $c$ . The semantic vector is repeated across the temporal axis and injected into the decoding layers to ensure frame-wise condition awareness.

During training, the generator is optimized using the following reconstruction loss:

L_{recon} = \frac{1}{T N} \sum_{t = 1}^{T} \sum_{i = 1}^{N} {∥ {\hat{x}}_{i}^{t} - x_{i}^{t} ∥}_{2}^{2}

(12)

Where

{\hat{x}}_{i}^{t}

is the predicted position of joint

i

at frame

t

, and

x_{i}^{t}

is the corresponding ground truth.

This conditional generation architecture allows users to control motion generation based on simple symbolic labels, making the system compatible with text-guided animation tools or interactive authoring platforms. Furthermore, this mechanism supports domain adaptation and style transfer by conditioning on different motion “styles” such as “normal walking,” “exaggerated walking,” or “robotic movement.” In this context, “exaggerated” refers to stylistic motion derived from cartoon physics, high-speed stunt sequences, or genre-specific exaggeration such as superhero landings or animated squash-and-stretch effects. “Structural fidelity” relates to preserving joint range limits and balanced center of mass trajectories.

Motion visualization and biomechanical evaluation system

Generated motion sequences must be rigorously validated before being used in professional animation pipelines. Beyond numerical reconstruction accuracy, it is necessary to analyze their biomechanical plausibility, visual continuity, and structural fidelity. To this end, BCMG-Net incorporates a comprehensive motion visualization and evaluation system that combines 3D reconstruction, physical metric computation, and perceptual assessment.

3D motion reconstruction in unity

To support intuitive assessment of generated motion, we developed a real-time 3D visualization environment using Unity3D. The predicted motion sequence $\hat{X} \in R^{T \times N \times 3}$ is mapped onto a virtual character skeleton. Each joint in the output is bound to a corresponding bone in the avatar rig, allowing frame-by-frame animation. The system supports real-time rendering, camera rotation, and playback speed adjustment, enabling users to observe the motion from multiple perspectives. This makes it possible to identify physical anomalies such as floating joints, unnatural gait, or sudden velocity shifts.

Biomechanical metric evaluation

To complement visual inspection, we define several objective metrics to evaluate physical realism:

Skeletal Consistency Error: Variance in bone lengths across frames; Mean Acceleration:

a_{mean} = \frac{1}{(T - 2) N} \sum_{t = 2}^{T - 1} \sum_{i = 1}^{N} {∥ {\hat{x}}_{i}^{t + 1} - 2 {\hat{x}}_{i}^{t} + {\hat{x}}_{i}^{t - 1} ∥}_{2}

(13)

Joint range of motion: Spatial displacement range per joint; total motion energy:

E_{total} = \sum_{t = 1}^{T - 1} \sum_{i = 1}^{N} {∥ {\hat{x}}_{i}^{t + 1} - {\hat{x}}_{i}^{t} ∥}_{2}

(14)

These metrics are computed across test sequences and compared with ground truth MoCap data to quantitatively assess the biomechanical plausibility and stability of the generated motion.

To supplement quantitative results, we conducted a human evaluation involving professional animators and biomechanics researchers. Participants rated each generated sequence on a 5-point Likert scale along three dimensions: motion naturalness, rhythmic coordination, and structural plausibility. Scores are aggregated to compute mean subjective ratings.

Experiments and results

Experimental setup

All experiments were conducted on a machine equipped with an NVIDIA RTX 3090 GPU, 128 GB RAM, and an AMD Ryzen 9 5950X processor. The model was implemented using PyTorch 1.13 and trained using the Adam optimizer with an initial learning rate of 0.0001 and a batch size of 32. Training was carried out for 100 epochs with early stopping based on validation loss.

Datasets

We use two publicly available motion datasets and one in-house annotated film dataset for training and evaluation:

• Human3.6 M: A large-scale motion capture dataset containing over 3.6 million 3D human poses across 15 activities performed by 11 subjects. Each motion sequence is sampled at 50 Hz and includes 32 joint positions. After preprocessing and resampling, sequences are downsampled to 30 FPS with 22 joints retained.

• CMU MoCap: A diverse dataset with over 2600 motion sequences covering walking, running, dancing, and sports. It is used primarily to test generalization across activity types.

• FilmPose-10 (internal): A curated collection of 10 high-resolution video segments from commercial films, containing exaggerated or stylized cinematic motion. Each sequence is manually annotated and pseudo-3D reconstructed to match the same 22-joint skeleton structure.

All data are standardized through normalization and smoothing as described in Section 3.2. Specifically, the 10 clips were selected based on their diversity in genre (e.g., action, drama, and fantasy), motion styles (e.g., exaggerated and realistic), and frame density. Manual annotation was conducted by two experienced animators to ensure skeletal consistency. Table 1 summarizes metadata including clip length, resolution, and action labels.

Table 1.

Metadata of FilmPose-10 dataset.

Clip id	Genre	Motion style	Length (Frames)	Resolution (px)	Action label
FP01	Action	Exaggerated stunt	1500	1920 × 1080	Jump and roll
FP02	Drama	Realistic everyday	1200	1920 × 1080	Slow walk
FP03	Fantasy	Stylized combat	1800	1920 × 1080	Sword swing
FP04	Thriller	Tense stealth	1400	1920 × 1080	Crouch and sneak
FP05	Comedy	Exaggerated cartoon	1600	1920 × 1080	Slapstick fall
FP06	Action	Parkour movement	1700	1920 × 1080	Wall climb and vault
FP07	Sci-Fi	Robotic motion	1300	1920 × 1080	Mechanical arm movement
FP08	Fantasy	Stylized dance	1500	1920 × 1080	Whirling spin
FP09	Horror	Jerky, unnatural	1250	1920 × 1080	Disjointed crawling
FP10	Romance	Gentle interaction	1100	1920 × 1080	Hand-holding and embrace

Evaluation metrics

To comprehensively assess model performance, we employ a combination of structural, kinematic, and perceptual evaluation metrics. These include: (1) Reconstruction Error (RE), which measures the mean Euclidean distance between predicted and ground truth joint positions; (2) Bone Length Deviation (BLD), which quantifies skeletal stability by computing standard deviation in bone lengths across frames; (3) Mean Acceleration (MA), calculated as the second-order derivative of joint positions to assess motion smoothness; (4) Motion Energy (ME), indicating the total joint displacement within a sequence; and (5) Frechet Pose Distance (FPD), which evaluates distributional similarity between synthesized and real pose sequences in a learned feature space. FPD is computed using the pretrained MotionBERT encoder as the feature extractor. We use the penultimate layer embeddings of dimension 1024 for all frames in the sequence. This follows the benchmark protocol described in ⁹. Together, these metrics provide a rigorous evaluation of physical plausibility, stability, and visual quality.

Model comparison

We compared BCMG-Net against four leading models: HP-GAN³⁴: A GAN-based model for motion synthesis; ST-GCN³⁵: A spatio-temporal graph convolutional network that models skeleton motion as graph signals; MotionBERT⁹: A Transformer-based encoder pretrained for motion representation learning; ActFormer¹⁰: A Transformer model with conditioning capabilities for action-aware motion generation.

All baselines were retrained using our preprocessed skeleton format and under identical training settings for fairness, the results are shown in Table 2. Specifically, we re-implemented data preprocessing and loss functions for MotionBERT and ActFormer to align with our 22-joint skeleton convention and loss definitions. This ensures that all models are evaluated under identical skeleton structures and training objectives.

Table 2.

Comparison of BCMG-Net with baseline models.

Model	RE ↓ (mm)	BLD ↓	MA ↓	ME ↓	FPD ↓
HP-GAN	9.84	2.11	2.35	8.52	0.426
ST-GCN	8.13	1.84	2.02	7.98	0.391
MotionBERT	6.42	1.55	1.78	6.85	0.301
ActFormer	6.13	1.51	1.61	6.29	0.287
BCMG-Net	5.21	1.18	1.26	5.74	0.204

BCMG-Net achieves the lowest RE of 5.21 mm, indicating superior spatial accuracy in joint position prediction. The BLD of 1.18 demonstrates its strong ability to maintain anatomical structure over time, outperforming ST-GCN (1.84) and even Transformer-based ActFormer (1.51). In terms of motion smoothness, BCMG-Net yields the lowest MA (1.26), significantly reducing erratic joint fluctuations observed in other models. Furthermore, its minimized ME (5.74) reflects a more efficient and restrained use of joint trajectories, avoiding excessive movement common in GAN-based methods. Most notably, the FPD score of 0.204 shows that BCMG-Net generates motion distributions closest to real human motion, measured in a learned perceptual embedding space. All reported metrics are averaged over five independent runs, with standard deviations reported in parentheses. Statistical significance between BCMG-Net and baselines was validated using paired t-tests (p < 0.05), confirming the robustness of the improvements. These results confirm that the integration of biomechanical constraints does not merely provide regularization—it directly enhances both physical plausibility and perceptual quality, establishing BCMG-Net as a robust framework for realistic motion generation in complex, high-fidelity scenarios.

To rigorously assess the individual impact of each biomechanical constraint in the BCMG-Net framework, we conducted a series of ablation experiments in which one constraint term was removed at a time from the training objective. The results, summarized in Table 3, demonstrate that each constraint contributes uniquely to different dimensions of motion quality.

Table 3.

Performance of BCMG-Net under ablations of biomechanical constraints.

Configuration	RE $↓$	BLD $↓$	MA $↓$	ME $↓$
Full BCMG-Net	5.21	1.18	1.26	5.74
W/o bone constraint (no $L_{bone}$ )	5.83	2.05	1.44	5.78
W/o dynamics constraint (no $L_{dyn}$ )	5.67	1.22	1.73	6.04
W/o energy constraint (no $L_{energy}$ )	5.61	1.29	1.30	6.62
W/o biomechanical constraints	6.08	2.26	2.02	6.94

Removing the bone length constraint ( $L_{b o n e}$ ) results in a sharp increase in bone length deviation (BLD rises from 1.18 to 2.05), indicating that structural regularization is essential for maintaining anatomically plausible joint proportions. The absence of the dynamic smoothness constraint ( $L_{dyn}$ ) leads to a notable increase in mean acceleration (MA increases from 1.26 to 1.73), reflecting the emergence of jittery or discontinuous motion trajectories that are visually apparent in fast-moving joints. Excluding the energy constraint ( $L_{energy}$ ) increases total motion energy (ME from 5.74 to 6.62), resulting in exaggerated limb movements and inefficient trajectories, particularly during low-motion segments. When all constraints are removed simultaneously, the model’s performance deteriorates across all metrics, highlighting a clear regression toward structurally unstable, temporally inconsistent, and energetically wasteful motion patterns. These findings confirm that each biomechanical constraint captures a distinct yet complementary aspect of natural movement, and their combined effect is critical to achieving high-fidelity, physically grounded motion synthesis. The constraint architecture also offers modular flexibility, enabling adaptation to different action domains or physical modeling extensions in future work.

Biomechanical consistency and motion structure visualization

To further visualize the differences in motion behavior, we present four complementary evaluations in Figure 3.

Figure 3.

Multi-view visualization of motion quality across models. (a) Joint trajectory over frames; (b) histogram of velocity distributions (Velocity (units: m/s), histogram bin size: 0.05 m/s); (c) t-SNE projection of pose embeddings; (d) foot contact heatmap.

Subfigure (a) compares joint trajectories over time, where BCMG-Net closely aligns with the ground truth and avoids jitter observed in baseline outputs. Subfigure (b) reveals velocity distributions, in which BCMG-Net produces tighter, unimodal curves, while baselines exhibit broader, noisier dynamics, indicating instability. In subfigure (c), we project the pose feature embeddings into a low-dimensional space using a t-SNE-like method; BCMG-Net’s cluster overlaps significantly with the real motion cluster, while baseline points are scattered further afield, demonstrating distributional deviation. Finally, subfigure (d) visualizes a foot contact heatmap on the X-Z plane, where BCMG-Net yields consistent contact regions, confirming stable and grounded locomotion. Collectively, these visualizations reinforce the quantitative results and highlight the advantages of biomechanical regularization in producing coherent, physically valid motion.

To further assess the biomechanical plausibility of motion generated by BCMG-Net, we visualize two key indicators in Figure 4: the joint range of motion (ROM) and the trajectory of the center of mass (CoM).

Figure 4.

Biomechanical evaluation of generated motion sequences. (a) Joint range of motion (ROM) heatmap across eight joints and three models (joints labeled as H (hip), K (knee), A (ankle); ROM units: degrees); (b) center of mass (CoM) trajectory on X-Z plane during walking sequence.

Figure 4(a) presents a ROM heatmap comparing Ground Truth, BCMG-Net, and baseline models across eight major joints. It shows that BCMG-Net closely replicates the physiological ROM observed in real human motion, with only minor deviations within acceptable biomechanical thresholds. In contrast, the baseline model often exceeds typical ROM values, especially at the knee and elbow joints, suggesting limb overextension and potential instability during motion synthesis. This indicates that the structural constraint loss $L_{b o n e}$ in BCMG-Net effectively preserves anatomical joint limits. Figure 4(b) illustrates the CoM trajectory on the X-Z plane over time. The trajectory of BCMG-Net is smooth, periodic, and closely aligned with the ground truth, reflecting well-balanced gait and coordinated weight shifts. Conversely, the baseline model exhibits a noisy and erratic CoM path, indicating poor balance control and inconsistent locomotion rhythm. This validates the contribution of the dynamics and energy constraints $L_{dyn}$ and $L_{energy}$ in stabilizing motion flow. Together, these visualizations confirm that BCMG-Net not only excels in numerical accuracy but also respects essential biomechanical principles, including joint range integrity and balance stability—both critical for realistic, physically coherent character motion.

Conclusion

This study introduces BCMG-Net, a novel motion generation framework that combines data-driven learning with embedded biomechanical constraints to synthesize physically plausible and semantically controllable motion for film and animation applications. By integrating anatomical regularization (bone length preservation), dynamic continuity (acceleration smoothness), and energetic efficiency into a Transformer-based architecture, BCMG-Net bridges the gap between perceptual realism and biomechanical validity—a limitation in many existing neural motion models.

Our experiments across both structured motion capture datasets and stylized cinematic sequences show that BCMG-Net consistently achieves lower reconstruction error, higher anatomical fidelity, and more stable movement patterns compared to leading baselines. In addition to quantitative benchmarks, we employ a suite of biomechanical visualizations—ROM heatmaps and CoM trajectories—to provide interpretable and multidimensional evidence of the model’s physical coherence. Looking forward, our approach opens several directions for future research. First, integrating real-time physical simulation or differentiable physics engines may further improve dynamic realism, especially in interactive environments. Second, expanding semantic conditioning to support language-driven or multimodal control could enhance creative flexibility. Finally, adapting the framework for non-human characters or exaggerated stylized motion remains an exciting challenge for cinematic and game animation.

Despite its strengths, BCMG-Net currently depends heavily on the Human3.6 M dataset, limiting domain generalization to occluded or multi-person scenarios. Future work will explore robust adaptation to real-world noisy data and multi-character interactions.

Footnotes

Acknowledgments

I thank the anonymous reviewers whose comments and suggestions helped to improve the manuscript.

ORCID iD

Zhaoqi Li

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Sekimoto

. Rhythmic bodies: sensorial multimodality, entrainment, and intercultural communication. In: Multimodal communication in intercultural interaction. Milton Park: Routledge, 2023, pp. 41–57.

D'Armenio

. Beyond interactivity and immersion. A kinetic reconceptualization for virtual reality and video games. New Techno Humanities 2022; 2(2): 121–129.

Chai

. AI-driven knowledge-based motion synthesis algorithms for graphics and animation. New York: State University of New York at Stony Brook, 2024.

Studer

Agrawal

Borer

, et al. Factorized motion diffusion for precise and character-agnostic motion inbetweening. In: Proceedings of the 17th ACM SIGGRAPH conference on motion, interaction, and games. New York: ACM, 2024, pp. 1–10.

Yuanliang

Zhe

. Integration effect of artificial intelligence and traditional animation creation technology. J Intell Syst 2024; 33(1): 20230305.

Suo

Tang

. Motion capture technology in sports scenarios: a Survey. Sensors (Peterb, NH) 2024; 24(9): 2947.

Zhu

, et al. Human motion generation: a survey. IEEE Trans Pattern Anal Mach Intell 2023; 46(4): 2430–2449.

Yang

Zhang

Xiao

, et al. Efficient data-driven behavior identification based on vision transformers for human activity understanding. Neurocomputing 2023; 530: 104–115.

Zhao

Zhuang

, et al. LiDAR-based human pose estimation with MotionBERT. In: 2024 IEEE international conference on mechatronics and automation (ICMA). Piscataway: IEEE, 2024, pp. 1849–1854.

10.

Song

Wang

, et al. Actformer: a gan-based transformer towards general action-conditioned 3d human motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023. pp. 2228–2238.

11.

Zhang

Cai

Pan

, et al. Motiondiffuse: text-driven human motion generation with diffusion model. IEEE Trans Pattern Anal Mach Intell 2024; 46(6): 4115–4128.

12.

Yin

Wang

Tat

, et al. Motion artefact management for soft bioelectronics. Nat Rev Bioeng 2024; 2(7): 541–558.

13.

Roupa

da Silva

Marques

, et al. On the modeling of biomechanical systems for human movement analysis: a narrative review. Arch Comput Methods Eng 2022; 29(7): 4915–4958.

14.

Bracamonte

Saunders

Wilson

, et al. Patient-specific inverse modeling of in vivo cardiovascular mechanics with medical image-derived kinematics as input data: concepts, methods, and applications. Appl Sci 2022; 12(8): 3954.

15.

Reda

Onsy

Haikal

, et al. Path planning algorithms in the autonomous driving system: a comprehensive review. Robot Autonom Syst 2024; 174: 104630.

16.

Chen

Chou

Lee

, et al. Human motion tracking using 3d image features with a long short-term memory mechanism model—an example of forward reaching. Sensors (Peterb, NH) 2021; 22(1): 292.

17.

Andrade-Ambriz

Ledesma

Ibarra-Manzano

, et al. Human activity recognition using temporal convolutional neural network architecture. Expert Syst Appl 2022; 191: 116287.

18.

Wang

Lei

. Research on character action control method based on multi phase-functioned neural network and state machine. J Phys : Conf Ser 2021; 2031(1): 012035.

19.

Zhang

Chen

, et al. Unified cross-structural motion retargeting for humanoid characters. In: IEEE Transactions on Visualization and Computer Graphics. Piscataway: IEEE, 2024.

20.

Serifi

Grandia

Knoop

, et al. Robot motion diffusion model: motion generation for robotic characters. In: SIGGRAPH asia 2024 conference papers. New York: ACM, 2024, pp. 1–9.

21.

Feng

Mao

Zhang

, et al. Gait-symmetry-based human-in-the-loop optimization for unilateral transtibial amputees with robotic prostheses. IEEE Trans Med Robot Bionics 2022; 4(3): 744–753.

22.

Zhong

Zhao

You

, et al. PoseCrafter: one-shot personalized video synthesis following flexible pose control[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2024, pp. 243–260.

23.

Penichet-Tomas

. Applied biomechanics in sports performance, injury prevention, and rehabilitation. Applied Sciences 2024; 14(24): 11623.

24.

Zhang

. Film and television animation production technology based on expression transfer and virtual digital human. Scalable Comput Pract Exp 2024; 25(6): 5560–5567.

25.

Wang

, et al. ASMNet: action and style-conditioned motion generative network for 3D human motion generation. Cyborg Bionic Syst 2024; 5: 0090.

26.

Lamghari

Bilodeau

Saunier

. Actar: actor-driven pose embeddings for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway: IEEE, 2022, pp. 399–408.

27.

Zhang

Fei

, et al. Promptgen: automatically generate prompts using generative models. In: Findings of the Association for Computational Linguistics. Albuquerque: NAACL, 2022, pp. 30–37.

28.

Jiang

Chen

Liu

, et al. Motiongpt: human motion as a foreign language. Adv Neural Inf Process Syst 2023; 36: 20067–20079.

29.

Huang

Chen

Wang

, et al. Fall detection model based on AlphaPose combined with LSTM and Lightgbm. In: MIPPR 2023: Pattern Recognition and Computer Vision. California: SPIE, 2024, Vol 13086: 84–91.

30.

Mundt

Born

Goldacre

, et al. Estimating ground reaction forces from two-dimensional pose data: a biomechanics-based comparison of alphapose, blazepose, and openpose. Sensors (Peterb, NH) 2022; 23(1): 78.

31.

Lee

Kim

Beak

, et al. Real-time pose estimation based on ResNet-50 for rapid safety prevention and accident detection for field workers. Electronics 2023; 12(16): 3513.

32.

Liu

Dai

Tang

, et al. Bi-LSTM sequence modeling for on-the-fly fine-grained sketch-based image retrieval. IEEE Trans Artif Intell 2022; 4(5): 1178–1185.

33.

. Transformer-based partner dance motion generation. Eng Appl Artif Intell 2025; 139: 109610.

34.

Zhang

Zhou

, et al. HpGAN: sequence search with generative adversarial networks. IEEE Trans Neural Netw Learn Syst 2021; 34(8): 4944–4956.

35.

Lovanshi

Tiwari

. Human skeleton pose and spatio-temporal feature-based activity recognition using ST-GCN. Multimed Tools Appl 2024; 83(5): 12705–12730.