Abstract
Data-driven methods with deep neural networks demonstrate promising results for accurate modeling in soft robots. However, deep neural network models rely on voluminous data in discovering the complex and nonlinear representations inherent in soft robots. Consequently, while it is not always possible, a substantial amount of effort is required for data acquisition, labeling, and annotation. This article introduces a data-driven learning framework based on synthetic data to circumvent the exhaustive data collection process. More specifically, we propose a novel time series generative adversarial network with a self-attention mechanism, Transformer TimeGAN (TTGAN) to precisely learn the complex dynamics of a soft robot. On top of that, the TTGAN is incorporated with a conditioning network that enables it to produce synthetic data for specific soft robot behaviors. The proposed framework is verified on a widely used pneumatic-based soft gripper as an exemplary experimental setup. Experimental results demonstrate that the TTGAN generates synthetic time series data with realistic soft robot dynamics. Critically, a combination of the synthetic and only partially available original data produces a data-driven model with estimation accuracy comparable to models obtained from using complete original data.
Introduction
Soft robots aim to transcend traditional robots for use in unstructured or unpredictable environments. 1 This is achieved by designing flexible and compliant robots that are able to conform to their surrounding structure. 2 However, obtaining a kinematic or dynamic model of a soft robot is analytically challenging due to the diverse behaviors, inherent nonlinearities, and theoretically infinite degrees of freedom. 3 Taking these difficulties into account, previous efforts on analytical modeling typically include simplifications and assumptions in their formulations. 4
For example, kinematic models in previous studies5–9 are derived using the piecewise constant curvature simplifications, whereas the dynamic models in previous studies10–15 are obtained via Euler–Lagrangian formulation or Cosserat rod model with quasi-static assumptions. Based on these approximations, the acquired analytical models may not accurately capture the complex nonlinear dynamics of a soft robot and thus do not necessarily guarantee a good estimation performance. 16 A more complex model may solve this problem, but the process of deriving such models is tedious considering that soft robots composed of low Young's modulus materials manifest highly nonlinear characteristics.
To mitigate the difficulties in modeling soft robots analytically, data-driven modeling has been extensively studied and successfully applied for soft robots. For instance, Elgeneidy et al. 17 applied a feed-forward network to estimate the curvature angles of a pneumatic actuator. Thuruthel et al. 18 demonstrated the significant potentials of deep neural networks as an empirical model to estimate the tip position and contact force of a soft finger. Wang et al. 19 paired a convolutional neural network with an autoencoder to reconstruct the three-dimensional (3D) shape of soft bodies under deformation. However, these data-driven approaches assume the availability of high-quality data representative of the soft robots. 20
Unfortunately, data collection is often time-consuming as the soft robot would need to be exhaustively actuated to sufficiently cover its task space.21,22 Furthermore, for some potential soft robot applications such as underwater exploration or surgical robotics, data collection processes can be expensive and limited. For instance, in underwater grasping 23 or exploration, 24 it can be tedious and costly to send the robot underwater to collect data for behavioral modeling. 25 Besides this, collecting medical soft robot data poses its own challenges due to legal and ethical concerns. 26 Evidently, data-driven models become difficult to implement when there are insufficient data. Models trained on structurally incomplete data fall short at capturing the nonlinear full-range dynamics of the complex soft robots (underfitting). 21 On the contrary, models could also fit abnormally well to the limited training data (overfitting). 27
Existing approaches alleviate this requirement of voluminous data using simple or nonparametric models. For example, regression analysis is used in Elgeneidy et al. 17 for bending angle prediction. Similarly, a local Gaussian process is utilized in Fang et al. 28 for online control. These methods rely less on data availability and thus are less prone to overfitting at the cost of having less expressive models. An alternative approach is to use empirical simulation models or environments. For example, Runge et al. 29 and Massari et al. 30 generated data for neural network training via the finite element method (FEM). Nevertheless, both data-driven and FEM approaches lack generalization ability in addressing factors such as altering material properties, inconsistent fabrication, and changes in environment. A slight variation in these factors could result in substantial differences between the developed models and the actual soft robots. 29 Therefore, data would have to be collected regularly to fine-tune these models via costly and laborious data collection process.
To generalize empirical models trained on simulated environments to real-world scenarios, recent works incorporate sim-to-real transfer strategies to reduce the sim-to-real gap31,32 in FEM-simulated data for soft robot design, 33 modeling and control,33–35 and state estimation.36,37 Alternatively, previous studies38–40 employ domain randomization, whereby the dynamics of a simulated environment are randomly changed, thus allowing a trained model to generalize better to unseen real-world scenarios. These methods attempt to bridge the sim-to-real gap by modeling discrepancies between the simulated and real-world domains as variability in the simulated domain. Besides this, the works41,42 have used differentiable simulators to generate data for data-driven models.
These models provide an avenue for reducing the sim-to-real gap by developing efficient, gradient-based optimization algorithms to find the simulation parameters that best fit the observed sensor readings. However, these analytical models can only predict the dynamical behavior of systems for which they have been designed. Apart from that, the underlying mathematical formulation of high-fidelity simulation models on highly complex environments is time-consuming to develop, due to the nonlinear characteristics of soft robots.43,44 Taking this into consideration, data-driven frameworks offer a straightforward alternative to model nonlinear, complex soft robot behaviors and interactions by omitting the mathematical rigor of simulation models, although at the expense of model explainability.
Apart from the aforementioned techniques, there are also methods or techniques that generate synthetic data by extrapolating the original data. For example, data augmentation applies random transformations to the original training data by introducing random noise also known as jittering, 45 scaling, 46 rotation or flipping, 47 and time-warping. 46 Several works have adopted data augmentation techniques in the robotics field, for instance, Su et al. 48 used data augmentation techniques to improve image segmentation in agricultural robots. Liu and Li 49 used perspective transformation to expand existing data sets for estimating the motion of an eye-in-hand robotic system. Similarly, data augmentation can also be applied to time series robotic data.
For example, Wang and Majewicz Fey 50 and Lakomkin et al. 51 used data augmentation techniques to prevent overfitting on their inference models for robot-assisted surgery and human–robot speech interaction, respectively. However, the above-mentioned methods rely on random transformations of the original time series, which might deviate from the true underlying distribution of the soft robots. In addition, these transformations are applicable to only specific data sets (i.e., vision data). Alternatively, meta-learning is a machine learning technique that attempts to learn from a small set of data by exploiting previously learned knowledge. However, meta-learning in robotics is primarily based on vision data, which may not be directly applicable for soft robots with complex dynamics in a time series domain. 21
Recently, generative neural network models have gained immense traction due to their ability to accurately transfer from a random distribution (typically uniform noise) to a desired distribution, allowing explicit generation from the true underlying data distribution. 52 In addition, these models are able to extrapolate beyond the distribution of the original data, which can potentially resolve the sim-to-real gap. 53 Generative adversarial nets (GANs) 54 are a class of deep generative models capable of generating realistic synthetic data by learning the dynamics of the original data through an adversarial learning process.
Several works have employed GAN models for domain adaptability by transferring simulated images to realistic ones to reduce the real-to-sim gap in rigid robots for tasks, such as grasping55–57 and object interaction. 57 In contrast, for our study, we exploit the generative capability of GANs in expanding the volume of a limited time series data set to improve the training of deep neural networks for soft robot application,58,59 at a reduced effort and cost of data collection. 60 In addition, GANs provide a data-driven framework for generating synthetic data, thus extricating dependency on the expert knowledge required to construct high-fidelity analytical and FEM models.
Commonly, data from the domain of soft robots are composed of real-valued time series readings sourced from sensors, such as flex sensors,17,18 pressure readings20,61 and force magnitudes.62,63 Therefore, it is imperative that the synthetic samples generated using GANs resemble the nonlinear and multivariate time series of soft robotic data, that is, time series that will most likely be nonlinear and consist of more than one time-dependent variables. 18 Recently, a few pioneering works have successfully adapted GANs for the time series domain.64–66 However, these works were not specifically designed to address the complexities of the soft robotics domain.
To this end, we propose a novel time series GAN model, Transformer-TimeGAN (TTGAN), for realistic soft robotic data generation. Concretely, we aim to capture the nonlinear multivariate properties of the data by proposing a GAN architecture for handling time series, integrated with self-attention networks to accurately model the possible temporal dependencies in the data.66,67 We then show that our proposed GAN model can be used to compensate for the large data requirement in deploying data-driven models for soft robotics. We test our proposed methodology on a widely used soft robotic platform—a pneumatic soft gripper (PSG).15,20,68 This platform facilitates the collection of a full data set needed for adequate result validation while also being sufficiently representative of a complex soft robotic time series. In detail, we collect a soft robotic data set composed of pressure, flex, force, and motion variables sourced from sensors within our pneumatic gripper platform. We then use the proposed GAN model to generate synthetic samples of data, which we combine with a small amount of original data to train a data-driven model for a multimodal sensing task.
Furthermore, considering that soft robots intrinsically have high behavioral diversity due to their soft and compliant body,69–71 it would be beneficial to isolate data generation to select desired behaviors for more accurate representation learning. Therefore, we introduce a conditioning vector as in Dai et al. 72 to our proposed architecture producing Conditional TTGAN (CTTGAN), enabling data generation of specific robot behaviors based on the corresponding conditional vector.
To the best of our knowledge, this is the first work that investigates the feasibility of GANs for synthetic data generation, as an alternative to the costly and laborious data collection process in soft robotics. In summary, our contributions in this article are highlighted as follows:
We propose a novel TTGAN, a time series generative model for synthetic data generation. Here, we forgo the commonly used autoregressive networks and opt instead for attention networks to capture the nonlinear and multivariate dynamics of soft robotic data in a more succinct latent space to aid the learning of the GAN network. Experimental results show that our model is able to better capture the dynamics of soft robotic data and generate more realistic time series compared with the relevant state of the art. We adopt the conditional framework into our model, namely CTTGAN, by introducing conditional labeled vectors when training and generating data. Conditional vectors afford control over the distribution of the synthetic data generated, which allow data generation of specific robot behaviors. Experimental results show that the CTTGAN model is able to complement data sets skewed to specific soft robot behavior. To demonstrate the validity of our proposed approach, we design, fabricate, and collect data on three different soft robot platforms. Experimental results show that by combining synthetic data with a small subset of original data, we achieved comparable accuracy to that of a model trained using full-length original data on all three soft robot platforms.
The results presented in this work fundamentally depict a viable alternative to costly and exhaustive data collection required for data-driven methods in soft robotics.
Materials
Design and fabrication
We use a popular class of soft actuators known as PneuNet 68 to design and develop a pneumatic-based soft gripper used in our experiments. Our PSG is composed of three individual pneumatic soft fingers (PSFs) held onto a 3D-printed holder, with each PSF being 120° apart from the other, as shown in the top left box in Figure 1. The PSFs are constructed following identical fabrication processes and are composed of two sections. The first section consists of a main body made up of a series of channels and chambers. The second section consists of a base layer embedded with a flex sensor (4.5″; SparkFun) and an inelastic material (paper) to make the layer inextensible. Both the main body and the base layer of the PSF are formed using 3D-printed silicone (EcoFlex 0050; Smooth-On, Inc.) mold.

Experimental setup for PSG data collection. Motion capture cameras are focused on the PSG platform to capture its movement by tracking the reflective markers placed along the body of each of the PSFs. Red arrows on the table show the three axes (X, Y, and Z) of the resultant forces. Shown in the top left, the D-D plane shows the top view of the PSG relative to the contact bulb. Blue arrows indicate two-axis (X, Z) PSF contact forces exerted by a single PSF, whereas red arrows indicate the three-axis (X, Y, Z) resultant grasping forces exerted by the compounded PSG. Data are collected for two soft robotic behaviors under two actuation patterns. In total, four scenarios were captured as shown on the right. PSF, pneumatic soft fingers; PSG, pneumatic soft gripper.
The connections of the flex sensor are soldered, and the PSF is fixed with a sharp-end pneumatic pipe as an air inlet. Finally, the base layer and the main body are joined together using additional silicone. Although the fabrication processes are identical, the act of manually replicating the PSFs along with the use of soft materials resulted in slight variations such as asymmetrical air columns. In addition, the flex sensor embedded in each of the PSFs exhibited inconsistent (out-of-distribution) sensor dynamics, resulting in different sensor responses to the same PSF bending configuration. Such variability is not uncommon as the task of achieving consistent dynamics for sensors would typically require a precise automated fabrication process in a highly controlled environment, which may not be economically practical. Nevertheless, with sufficient data, a data-driven approach in modeling the gripper would be able to account for inconsistent dynamics of the individual PSFs.
Experimental procedure
An electropneumatic regulator (ITV1030; SMC Corporation) connected to a pneumatic supply is used to modulate the pressure of our pneumatic system using pulse width modulation (PWM) signals. 73 The modulated pressure is supplied to each PSF inlet and a pressure sensor (MPXH6400A; NXP) in parallel, where the pressure inside the PSFs is assumed to be the same as the measured pressure. The PWM signal is controlled using a microcontroller (PSoC®5LP) that manipulates the duty cycle (control input) of the PWM signal.
Motion of the PSG is captured by five cameras (OptiTrack Flex13; NaturalPoint, Inc.) that continuously measure the coordinates of 10 reflective ball-markers placed along the inextensible base layer of each PSF. Furthermore, a sensing circuit is used to convert the change in resistance of the embedded flex sensor to voltage readings. For simplicity, the sensor values are collected directly without any filtering or amplification. Contact and grasping force measurements are collected using a multi-axis load cell (Axia80; ATI Industrial Automation, Inc.) with a contact bulb attached on top of it. Here, we consider the forces applied on the X, Y, and Z axes of the load cell, as shown in Figure 1.
The PSG is actuated using a series of pseudorandom pressure levels, within a predetermined pressure range. Two different actuation patterns, namely Oscillatory actuation and Random actuation, are used to verify the generality of our approach to different input signals. In the former, two types of input voltage signals are fed to the electropneumatic regulator to generate a gradual oscillatory pattern, and each PSF is actuated with the same input pressure. In the latter, a faster random pressure actuation pattern is used, and each PSF is independently actuated with random input pressure.
For each actuation pattern, data were collected for two experimental behaviors. In the first behavior, the PSG is actuated without any obstructions in its task space (Free Bending), whereas in the second behavior, the PSG comes in contact with the contact bulb near its fingertips to mimic surface contact (Tip Contact). The contact bulb is placed in the center of the three PSFs to measure the resultant grasping force of the PSG.
Data collection
Accounting for the two actuation patterns and experimental behaviors, data were collected for a total of four scenarios, namely Oscillatory Free Bending, Oscillatory Tip Contact, Random Free Bending, and Random Tip Contact, as depicted in Figure 1 (right). Each data sample contains measurements for the actuation pressure and flex sensor voltage for the three PSFs. Furthermore, the 10 markers placed on each of the PSF are recorded as 3D-Cartesian coordinates to capture the full motion of the PSG (30 markers total). Finally, contact force is captured as the resultant force applied by the PSG along its X, Y, and Z axes totaling to 99 measurements per sample. Data were collected at a rate of 10 Hz using MATLAB. 18 We partitioned the data into training (9000 samples per scenario) and testing sets (3000 samples per scenario).
Methods
Synthetic data generation
Figure 2 illustrates the proposed architectures, training, and data generation process. To generate realistic synthetic data using the proposed TTGAN model, we first describe how we prepare our data to be fed into the model. Let Xt be a vector of features (flex, pressure, force, position markers) that makes up the sequence

Architecture diagram for the training and usage of the proposed TTGAN and CTTGAN generative models. The numbers inside the brackets shown in the figure correspond to the equations explained in the text. TTGAN, Transformer TimeGAN; CTTGAN, Conditional Transformer TimeGAN.
We describe how each component of the network is used and trained as follows. The autoencoder network is composed of an encoder and recovery network, which are used to provide a reversible mapping between feature and latent spaces. The former maps the input, xt, to its corresponding latent code, ht, whereas the latter works in the opposite direction, mapping the latent codes back to their feature representation,
These are shown formally in Equations (1) and (2) where both E and R are parameterized by Transformer encoders, and PE are the position embeddings proposed in Vaswani et al.
67
Overall, the autoencoder network aims to provide accurate reconstructions,
The GAN network is composed of a Generator, G, and Discriminator, D, which operate within the latent space produced by the autoencoder network. G is fed in a random noise vector sampled independently of a uniform distribution,
Conversely, the Discriminator, D, is fed with real or synthetic data points and attempts to provide a binary classification feedback.
Both G and D are parameterized by a 3-layered long short-term memory (LSTM) network, each with 24 hidden dimensions. To train G and D, the GAN framework is utilized where the networks are trained with differing objectives of the same loss. This loss,
We also used the supervised loss proposed in Yoon et al.
66
to aid the model in synchronizing the latent dynamics of the real and synthetic data. Using maximum likelihoods, this yields the supervised loss,
Evaluation of synthetic data
To quantitatively analyze the generated data, we measure how the synthetic data affects the representational learning of an empirical model used to estimate perceptive variables of our soft robotic system, that is, a multimodal sensing task. We devise two experiments that reflect realistic scenarios, which are referred to as Experiments A and B. In both experiments, we use LSTM networks as the backbone of our inference model.
74
Note that this backbone is model-agnostic, and any suitable time series network can be used to parameterize the inference model. In this work, we use a three-layered LSTM model with each layer having
Using flex and pressure readings as the input sequence, the inference model is used to estimate the target sequence composed of the (1) bending state of the PSG via the 30 position markers,
The proposed TTGAN model can be useful in conditions where data collection is limited (i.e., due to physical, cost, or time constraints) and can be used to uniformly increase the volume and variety of a data set. We verify this through Experiment A, as also explained in Figure 3. We simulate this condition by arbitrarily removing half of the collected training data from each of the collected scenarios (resulting in the Half data set). We train the TTGAN model using only the Half data set and produce an equal amount of synthetic data. We also generate synthetic data using popular time series data augmentation techniques, 53 such as jittering, scaling, rotation, time warping, 46 random guided warping, 75 and suboptimal warped time series generator (Spawner), 76 as a fair comparison to the proposed method.

Experimental scenarios investigated. In Experiment A, each scenario as well as the concatenation of all the scenarios (Overall) are alternatively used as the Complete data set. In contrast, Experiment B only uses Overall as the Complete data set. Testing for both experiments follows the same methodology, whereby the trained inference models are evaluated using an unseen test set.
The inference model is then individually trained on four sets of data, namely the full training data (Complete), the halved training data (Half), the TTGAN-generated synthetic only data (Synthetic), and a mixture of the halved training data with the synthetic data sets generated (Mixed). Once trained, we expose the inference model to the test set for the first time and compute the root-mean-square error (RMSE) by comparing the estimated values with the ground truth. We then normalize the RMSE by dividing the maximum and minimum values producing the normalized root-mean-square error (NRMSE), which we use to quantitatively compare the information learned by the inference model trained on the different data sets.
However, uniformly removing data from each scenario might not reflect a real-world context. Such conditions are prevalent when data collection is costly for certain behaviors but inexpensive for others. 25 Similarly, instrument malfunction or software errors midway through data collection can lead to skewed data sets. Therefore, we investigate this condition using Experiment B, as also shown in Figure 3. Here, we skew our data set by randomly removing half of our training data from a target behavior (i.e., either Free Bending or Tip Contact) while keeping the other behavior intact. In contrast to the vanilla TTGAN, the CTTGAN model is able to dictate the distribution of data generated to match only the removed behavior, hence balancing the data set.
To evaluate the CTTGAN model, we compute the NRMSE of the inference model after training on four data sets: Complete—as above, the skewed data set with a target behavior removed (Skewed), a mixture of the Skewed data set and synthetic data generated with CTTGAN (Mixed-CTTGAN), and a mixture of Skewed and synthetic data generated with TTGAN (Mixed-TTGAN). Both the TTGAN and CTTGAN models are trained using only the Skewed data set. The following section analyzes the quality and usability of the synthetic data on the PSG platform. To further ascertain the adaptability of the proposed method, we included additional experiments in the Supplementary Section to first analyze the model's performance on more nonlinear behaviors using a single PSF; and second, to carry out a similar analysis on a multichannel pneumatic soft continuum (PSC) body capable of nonplanar bending.
Results and Discussion
Comparison of synthetic and original data
In this subsection, we analyze the quality of the synthetic data in retrospect to the original data. The analysis here focuses on the ability of TTGAN to capture the relationship between the input (i.e., flex and pressure) and output (i.e., force and position markers) variables for each of the scenarios collected. We also analyze the usability of the synthetic data by calculating a predictive score for each synthetic data set.
First, we carried out a t-distributed stochastic neighbor embedding (t-SNE) analysis on the original and synthetic data set. t-SNE is a widely used dimensionality reduction technique for visualizing high-dimensional data in a low-dimensional space. This assessment depicts how closely the distribution of synthetic samples matches that of the original samples in a two-dimensional space and is used as qualitative analysis. We run the t-SNE analysis for 300 iterations, setting the perplexity to 40. From Figure 4, we can observe that the generated data set bears a striking resemblance (in a lower dimensional space) to the original, based on their similarly evaluated t-SNE plots for all the scenarios. This result implies that the synthetic data generated are of high quality, in the sense that they are semantically similar to the original. Critically, this shows that the generative model is able to successfully capture the relationship between the input and output variables.

t-SNE visualization of the distribution of synthetic and original data variables (i.e., flex, pressure, force, and position markers) for each scenario is shown on the left where red dots denote original data and blue denotes synthetic data. On the right, we reconstruct the closest synthetic and original t-SNE variables back to the feature space using the k-means nearest-neighbor algorithm. Trajectories of each PSF (across time) are depicted as moving lines on the reconstructed position markers. A video demonstrating the real-time comparison of the original and synthetic data is provided as a Supplementary Movie S1. t-SNE, t-distributed stochastic neighbor embedding.
To determine the usability of the synthetic data, we measure how well the generated data capture the predictive characteristics of the original data by calculating a predictive score. 20 This is calculated by using the synthetic data set to train a post hoc sequence-prediction model, which predicts the next-step temporal vectors over each input sequence. This trained model is then evaluated on the original data set, producing the predictive score in terms of a mean absolute error (lower the better). The predictive scores computed for each of the synthetic data sets are shown in Table 1. The low predictive scores produced by the proposed TTGAN further exemplify the ability of the generative model in producing a usable synthetic data set for representational learning. Apart from that, we observe that TTGAN consistently generates synthetic data sets with better (lower) predictive scores than the baseline TimeGAN (i.e., a well-known GAN framework for time series as proposed in Yoon et al. 66 ) across all scenarios.
Mean Absolute Error Comparison of Predictive Scores of TimeGAN Versus Transformer TimeGAN
Bold text indicates best performance.
TTGAN, Transformer TimeGAN.
To understand why TTGAN is able to produce synthetic data sets with better predictive scores, Figure 5 illustrates the t-SNE plots of the latent space produced by the autoencoder network of each model. We note that from the original distribution shown in Figure 5 (top), the Tip Contact scenarios have similar distributions and are clustered together in all three t-SNE plots. Both TimeGAN and TTGAN are able to cluster and segment Random Free Bending from other scenarios. However, while the TTGAN is able to cluster and segment the Oscillatory Free Bending scenario, TimeGAN fails to make a clear segregation. Evidently, the autoencoder in TTGAN is more equipped to capture the dynamics of the different scenarios, which allows the GAN network to learn the underlying dynamics of each scenario in greater detail.

t-SNE visualization of the latent space produced by the encoder component of TimeGAN (middle) and TTGAN (bottom) after training. Colored circles on the right represent the corresponding actuation state space of the PSG for each scenario. Note that Oscillatory Tip Contact and Random Tip Contact share the same state space.
Significance of synthetic data in compensating for the lack of original data
We analyze the significance of the synthetic data based on results of Experiment A, as shown in Table 2. Focusing on the Overall scenario, we note that although training on Complete provides the lowest NRMSE, we are able to get comparable performance, with a difference of only −0.311% for Force NRMSE and −0.062% for Position Markers NRMSE when training on the Mixed data set. In addition, we note that the inference model trained on the Mixed data set with synthetic data generated by TTGAN significantly produces a lower NRMSE than all the data augmentation comparison data sets. Likewise, the Mixed data set performs 2.271%/0.009% and 6.076%/1.003% better than the Half and Synthetic data sets for Force/Marker estimations. This implies that the synthetic data set is able to add useful diversity and volume to the halved original data set.
Comparison of Training Data Set and Corresponding Normalized Root-Mean-Square Error
The inference model is independently trained on four data sets of different contexts, namely Complete, Half, Synthetic, and Mixed, as explained above. Time series estimation is performed for each of the scenarios collected as well as Overall where all scenarios are concatenated and estimated serially. The NRMSE results depicted are the result of averaging over five independent runs.
RGW, random guided warping.
To further analyze the performance of the synthetic data set for representational learning, Figure 6 illustrates the time series estimation of resultant force, and Figure 7 compares the position markers against the ground truth. These figures illustrate that the inference model suffers from overfitting when trained on the Half and Synthetic data sets. This is notably pronounced in the Free Bending behavior, where the model predicts high force values even when there is no contact. Attributed to the lack of data samples in the aforementioned data sets, this leads to weaker model generalization on the unseen test set.

Time series estimation of resultant force by the inference model compared with the ground truth when individually trained on each of the different training data sets—Complete, Half, Synthetic, and Mixed.

Comparison of reconstructed (ground truth) and estimated position markers (left) by the inference model when trained on the different data sets for each of the collected scenarios. The corresponding real gripper is shown on the right. Furthermore, the estimated and real trajectories of the tip marker coordinates (across time) are also illustrated. A video demonstrating the real-time reconstruction of the time series is provided as a Supplementary Movie S1.
In comparison, the Mixed data set circumvents this problem due to its higher volume of data, achieving performance comparable to the Complete data set and the ground truth. We also confirm this by comparing the training and validation loss of the inference model when trained on the different data sets. As shown in Supplementary Figure S5, the Half and Synthetic data sets either suffer from overfitting, where the training loss is low but the validation loss is high, or underfitting, where the training loss does not converge (i.e., for Force when trained on Half).
To verify the generality of our method, this experiment is repeated with smaller amounts of original data to train the generative model and the inference model. We plot the performance of the inference model when trained on five data sets of varying size, namely:

NRMSE of the inference model when trained on varying amounts of training samples. The x axis represents the amount of original data used to train the inference model and the generative model. The dashed red horizontal line shows the performance of the model when trained on the full original data set and serves as a lower bound. NRMSE, normalized root-mean-square error.
For Synth 1, Synth 2 and Compensated, we use the same TTGAN model to generate the synthetic data, which is trained only on the Original data set. The full plot for each of the four collected scenarios is shown in Figure 8. By using the Complete data set as a lower bound (best) performance, we are able to conclude when we can stop the collection of original data. Although this generally differs for each scenario, in all cases, the Compensated data set (blue line in Fig. 8) is able to achieve comparable (within −2.5%/–1% NRMSE for Force/Marker estimations) as the Complete data set (dashed red line in Fig. 8) with only 12.5% (1125 data samples) of the amount of original data in Complete. These results suggest that the synthetic data are able to consistently complement the lack of original data. More importantly, these results demonstrate the feasibility of the proposed generative model in producing synthetic data, which complements the lack of original data in soft robotics.
Addressing skewed data sets with CTTGAN
In Experiment B, we analyze the impact of a skewed data set and the performance of the proposed CTTGAN model based on tabulated results in Table 3. We note that the inference model trained with the Mixed-CTTGAN data set is able to perform comparatively with only a −0.01% Force NRMSE and −0.006% Marker NRMSE difference compared with the Complete data set. It also performs 1.009% and 0.156% better for Force and Marker estimation than the Skewed data set. In addition, Figure 9 illustrates the data distribution between the complete and synthetic data sets for both the CTTGAN and TTGAN models.

Data distribution of generated synthetic data where
Normalized Root-Mean-Square Error of Transformer TimeGAN Versus Conditional Transformer TimeGAN on Skewed Data Set
The inference model is independently trained on four data sets where each data set only contains 25% of the removed behavior (either Free Bending or Tip Contact), except for the Complete data set. Results are evaluated for the Overall scenario and have been averaged over five independent runs.
CTTGAN, conditional TTGAN.
From Figure 9a, we note that the model generates data randomly within the data distributions of all the experimental behaviors, whereas Figure 9b shows that we are able to control generation of data to a specific data distribution, in this case, the Free Bending behavior (i.e., Oscillatory Free Bending and Random Free Bending). In this study, we conclude that CTTGAN demonstrates a better attention to the target behavior (i.e., Free Bending). This exemplifies the notion that that CTTGAN is beneficial in behaviors where data recollection is not an option when the data set is skewed.
Limitations and future works
An error analysis is carried out by examining the prediction error plots of the inference model trained on the Synthetic and Half data sets, as shown in Figure 10. Although the errors are generally small, we note that negative performance of the synthetic data is heavily driven by specific components of the soft robot motion (i.e., when it is fully actuated in Tip Contact). This is to be expected as TTGAN aims to learn and generalize from the original data distribution. This results in the model generating data close to the mean at a high likelihood, and hence, extreme points such as full actuation are less likely to be generated. In future works, we look to promote the generation of more variety by encouraging the model to generate more out-of-distribution samples.

Prediction error plot over time. Extreme points (models weakest predictions) are circled and labeled based on the extent of the PSG actuation at that time.
In addition, we could exploit more efficient Transformer architectures such as sparse Transformers 77 to improve the efficiency of the generative model. In regard to experiments carried out, we planned to apply our framework on a more complex multisegment soft robot composed of a series of identical PSC bodies, where each of the PneuNet channels are actuated independently. Our current pneumatic control setup allowed us to concurrently control the pressures of not more than three PneuNet channels. In addition, we could explore 3D printing soft strain sensor grid 78 directly into the PSC, as an alternative to the current laborious soft sensor fabrication process in Supplementary Section S.2.1. On top of that, the current load cell designated for point load measurements has prohibited us from modeling contact force distribution applied on the PSF. Inspired by the state-of-the-art physic-informed neural networks 79 and sim-to-real transfer learning, 34 there are new opportunities to model the contact force distribution via hybrid models that leverage both the explainability of rigorous analytical or FEM methods and the flexibility of deep learning frameworks.
Conclusions
This study is the first to propose a data-driven framework to capture the complex dynamics of soft robots aided by synthetic data. In a multimodal sensing task, the proposed TTGAN model is able to successfully generate high-quality synthetic data that can be combined with just half of an original sequence to train a data-driven inference model and obtain performance comparable to using the entire original sequence. This is impactful as it implies that the process of costly data collection can be replaced by gathering a smaller set of data and compensating with synthetic data. In addition, by utilizing CTTGAN, we are able to control the distribution of the generated synthetic data to focus on a specific behavior of a soft robot and also aids in the case of skewed data sets. The outcomes of this research make an important contribution toward enabling complex and data-driven based soft robot modeling, without relying on huge amounts of data.
Footnotes
Acknowledgments
The authors acknowledge the resources provided by the advanced computing platform at Monash University Malaysia to train the neural network models.
Authors' Contributions
S.S. developed the methodology, performed the experiments, and prepared the article. Z.Y.D. and J.Y.L. developed the system. V.M.B. and C.P.T. supervised the project. S.G.N. conceived, supervised, and funded the project. All authors read and provided feedback for the article draft and approved the final article.
Data and Materials Availability
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
