Abstract
Data-driven calibration methods have shown promising results for accurate proprioception in soft robotics. This process can be greatly benefited by adopting numerical simulation for computational efficiency. However, the gap between the simulated and real domains limits the accurate, generalized application of the approach. Herein, we propose an unsupervised domain adaptation framework as a data-efficient, generalized alignment of these heterogeneous sensor domains. A dual cross-modal autoencoder was designed to match the sensor domains at a feature level without any extensive labeling process, facilitating the computationally efficient transferability to various tasks. Moreover, our framework integrates domain adaptation with anomaly detection, which endows robots with the capability for external collision detection. As a proof-of-concept, the methodology was adopted for the famous soft robot design, a multigait soft robot, and two fundamental perception tasks for autonomous robot operation, involving high-fidelity shape estimation and collision detection. The resulting perception demonstrates the digital-twinned calibration process in both the simulated and real domains. The proposed design outperforms the existing prevalent benchmarks for both perception tasks. This unsupervised framework envisions a new approach to imparting embodied intelligence to soft robotic systems via blending simulation.
Introduction
Soft robots, composed of soft and stretchable materials, have long inspired future engineering applications toward safe, adaptive, and resilient interactions with unstructured environments and living organisms.1–5 Unlike traditional rigid robots, the inherent mechanical compliance of soft robots offers conformability and robustness to physical contact, which in turn comes at the cost of vulnerability. Therefore, the successful operation of autonomous soft robots demands delicate proprioception, which refers to the capability to intrinsically sense its own body kinematics, mainly using soft, stretchable sensors. These soft sensors adopt extensive stimuli-responsive materials, such as liquid metals, 6 conductive nanocomposites, 7 and permanent magnets 8 to perform various functionalities on demand. However, the accurate modeling of the kinematics analytically and numerically using these soft sensors is challenging owing to inconsistent manufacturing, viscoelastic hysteresis, and high nonlinearities in their dynamics. 9
Machine learning methods have shown great success in overcoming these limitations.10,11 Such data-driven approaches circumvent the explicit formulation of complicated, redundant soft robot dynamics. End-to-end mapping by embedded soft proprioceptive sensors is extensively leveraged for robot shape estimation,12–15 tactile sensing, 16 object identification, 17 and motion control. 18 However, current achievements suffer from inefficiencies in data acquisition as soft robot production varies largely depending on the manufacturing technique. In addition, the experimental process for the explicit representation of the robot shape mainly relies on optical camera measurements. Because of the use of optical markers, which require a certain gap among them, the explicit representation is confined to low-quality data, and visual occlusion occurs during large deformations. Considering these problems, sim-to-real approaches, which have been widely used in the field of generic robotics, are regarded as alternatives for optical measurements.13,19–24 (Table 1) Visual monitoring has long been the popular choice for perception in the sim-to-real approach, as it has considerable consistency with the real world. However, the persisting desire for the autonomous operation of soft proprioceptive robots has led to the demand for soft sensor simulation, which in turn suffers from the computational complexity of the robot body. Therefore, the development and maturation of effective sim-to-real technology requires a generalizable, data-efficient sim-to-real adaptation methodology for soft robot proprioception.
Comparison Between Previous Sim-to-Real Approaches in Soft Robotics and Our Method
LSTM, long short-term memory.
Herein, we propose an unsupervised domain-invariant representation learning approach as a label-free, high-performing, and generalized sim-to-real adaptation method for soft robotic perception. A dual cross-modal autoencoder (AE) enables the alignment of heterogeneous sensor domains at the latent feature level. As a proof of concept, the beneficial features of the proposed framework were examined by applying the framework to a multigait soft robot, which is one of the most popular soft robots, equipped with liquid metal (EGaIn) soft strain sensors. The calibration process was performed for dual principal perception tasks, shape estimation, and collision detection, which are predominantly involved in robotic exploration (Fig. 1b). An extensive comparative analysis with state-of-the-art methods highlighted the effectiveness of the proposed method in both simulated and real configurations toward accurate digital twinning (Fig. 1a).

Concept of our sim-to-real adaptation framework for proprioceptive soft robots.
Proprioceptive Multigait Soft Robot
Soft robot design and fabrication
The multigait soft robot consists of five air chambers, each of which governs the bending of four legs and a central body. We used highly extensible silicone (Ecoflex 00-50; Smooth-On, Inc.) as the chamber and polydimethylsiloxane (Sylgard 184, Dow Corning) as the base. The robot was fabricated by casting on a three-dimensional (3D)-printed mold, followed by bonding of the layers with silicone adhesives. We referred to the previous work 25 for the detailed fabrication process.
We then EGaIn soft strain sensors in each chamber to perform proprioception. These sensors are popular in fields related to soft robotics owing to their high repeatability, fabrication scalability, and adequate deformability.
26
The bending deformation of the robot is measured from resistance changes (

Fabrication of a proprioceptive multigait soft robot.
Simulation
To enable cost-efficient computation, we adopted an open-access reduced-order model of a multigait soft robot based on SOFA.
27
We simulated the soft sensor behavior by selecting the nodes of the modeled robot along the sensor pathway. The change in resistance at these nodes was calculated based on the sensor geometry. Following Pouillet’s law and Poisson’s ratio as done in,
28
the result of the sensor model can be simplified to the following relationship that depends only on sensor length:
A comparison of the simulated and physical sensor data is shown in Figure 3a and b. The correlation coefficients between each sensor and the pressure input are 0.9791 and 0.8295, respectively, indicating a strong correlation. However, the average relative error between the two sensors is 205%, demonstrating a significant domain gap despite the similarities in variation between the sensor values.

Comparison of the simulated and the real worlds.
System setup
Figure 4a and b presents the block diagram of the entire circuit and the system setup for the experiment. The environment for robot operation occupied a workspace of

Experimental setup and experimented motions of robots in the real and simulated domain.
Data collection
The data collection in both the simulated and real domains was performed by measurements (i.e., sensor value, pressure, kinematics, and contact force on the wall) under precisely scheduled robot locomotion. Two locomotion styles were embedded in the soft robot operation. In an unobstructed environment, we pressurized the robot with randomly generated pressure inputs to each of the five pneumatic chambers, with a frequency of 0.5 s over a duration of 5 min. Additionally, crawling motion was achieved using a sequence of seven manually designed actuation steps, each performed over 0.5 s: (i) starting from the rest state, and then pressurizing the (ii) two rear legs, (iii) central body, and (iv) front legs, and finally, depressurizing (v), (vi), and (vii) them in the same order. All the resulting measurements were performed at 100 Hz and then low-pass filtered and downsampled to 10 Hz for smoothing. We obtained six observation sets in an unobstructed environment and one observation in obstructed crawling.
The frame rate of the simulation was set at 0.02 s to fit the real-world period of measurement. For the unobstructed case, we pressurized the robot using the same pressure measured in the real world, corresponding to the aforementioned six observations. In the case of an obstructed environment, we collected 11 observations in the simulation, following an identical drive protocol as the real world. To achieve realistic system control during simulation, the pressure trajectory was emulated the same as that in the real world, as shown in Figure 3b. To be specific, we bounded the rate of the pressure change for each simulation step and gradually increased the pressure input until it reached the designated target value. As briefly described in the “Simulation” section, the sensor value was derived from the aforementioned model (1). The contact of the robot body with the wall was monitored by computing the reaction force. Last, we gathered the 3D positions of 123 nodes to train the kinematic estimation.
Sim-to-Real Adaptation with a Dual AE
Proposed network
Taking inspiration from the principles of domain adaptation, we aligned domains by matching them within the latent space. Intuitively, a shared feature space exists between these domains, aside from the unmodeled dynamics. Therefore, the domains can be bridged by creating a domain-invariant feature representation.
The core of our methodology is a long short-term memory (LSTM)-based dual AE architecture. This dual AE is designed to fulfill two primary objectives: reconstruction-based domain adaptation and anomaly detection. This strategic design ensures that the training for feature extraction intrinsically supports collision detection. We opted for a dual AE over a single AE because the significant gap between the domains necessitates separate encoders, and accurate reconstruction of each domain is crucial for precise anomaly detection. Building on this foundation, we utilized the extracted features to further train the kinematic estimation model.
The reconstruction task is integral to the AE’s functionality, facilitating the extraction of meaningful latent feature representations. For domain adaptation, the reconstruction task ensures that the latent space captures essential information from the sensor data, thereby enabling effective domain alignment. Conversely, in the context of anomaly detection, the latent features should perform poorly in unseen conditions to detect anomalies effectively. To address these conflicting requirements, we input both the latent features and pressure data into the decoder. This approach allows the decoder to focus more on its role in anomaly detection, ensuring that it does not reconstruct well in unseen conditions. To ensure that the latent features capture meaningful information from the sensors, we train them alongside the shape estimation model. This joint training ensures that the latent features are useful for both shape estimation and collision detection tasks.
The comprehensive structure of our network, including the dual AE, a kinematic estimation model, and a collision detection mechanism, is shown in Figure 5a. In our method, the dual AE is used to extract the shared latent space from each of the two sensor domains

Our proposed network of domain adaptation with multiple tasks by using a dual cross-modal autoencoder (AE).
In the first phase, the dual AE is trained for feature extraction. Each encoder
In the second phase, task-specific calibration was performed in the shared latent space. In our two tasks, kinematic estimation was learned through a neural network architecture—this learning model is referred to as the kinematic estimation model, which predicts the outermost shape of the robot
We used various baseline methods to validate the beneficial features in our framework and the performance of domain adaptation and anomaly detection. For domain adaptation, supervised learning based on the LSTM (vanilla LSTM), convolutional deep domain adaptation model for time-series data (CoDATS), 29 recurrent DANN (R-DANN), domain separation network (DSN), 30 and weight-shared AE (single AE) approaches are utilized. In the case of anomaly detection, we adopt prediction-based 31 (prediction-based real-to-sim vanilla LSTM, prediction-based dual AE) and reconstruction-based methods (reconstruction-based DSN 32 ). For more details on each baseline method and the network structures, see Supplementary Section S4.
Training details
For the training process, we utilized five observations of random actions in both the simulated and real worlds and ten observations of obstructed crawling in the simulated setting. From these observations, one sample of each observation set was used for validation, while others were used for training. To ensure stable and efficient training, all training data were normalized. The models were trained with the Adam optimizer
33
with an initial learning rate of
Implementation of collision detection
We conducted ground truth collision labeling based on the contact force, classifying events as collisions only in instances of strong impacts. This approach is grounded in the observation that the deformation of the robot becomes abnormal (i.e., less than usual) exclusively during strong collisions. In line with these characteristics, we used two approaches for error calculation based on the model type. For reconstruction-based methods, we calculated errors only when the reconstructed sensor data exceeded the input sensor data. In contrast, for prediction-based methods, we used the absolute value of the error to identify collisions.
To establish the threshold for collision detection, we first summed the errors observed across all five sensor channels into a single time series sequence. We then calculated the average of the 10th highest error values from each type of motion (namely, random action and obstructed crawling). The final threshold was determined by obtaining the mean value from five independently trained models, thereby ensuring a more generalized result.
Given that pressure was applied at intervals of 0.5 s, we segmented the data into sets corresponding to five-time steps. Collision labeling was then implemented based on the number of data points that exceeded the predetermined threshold within each of these segments. In the simulation environment, we set the threshold number to 1 and adjusted it to 2 in the real-world experimental setup to account for potential noise.
Results
We first assessed the adaptability across sensor domains through kinematic estimation, providing a comparative analysis with baseline methods. Next, we demonstrated the effectiveness of our approach in collision detection through anomaly detection experiments. To ensure the reliability and applicability of our results, we averaged them across five distinct trained models. The performance was evaluated by comparison with the aforementioned baseline methods and with a specific variant of our method—a dual AE without pressure for the decoder, denoted as dual AE (w/out P). This comparison allowed us to investigate the effects of integrating pressure data into the decoder. We concluded our analysis by explicitly illustrating the reduction of the domain gap by comparing the latent vectors extracted from each sensor domain.
Shape estimation results
We evaluated the performance of domain adaptation in each domain for two scenarios: random action and obstructed crawling. Table 2 summarizes the results of shape estimation based on physical and simulated sensors. To assess the accuracy of shape estimation, we computed the mean absolute error for the 3D position of 123 nodes over 3000 time steps.
Shape Estimation Results of the Simulated and Physical Sensors
AE, autoencoder; CoDATS, convolutional deep domain adaptation model for time-series data; DSN, domain separation network; R-DANN, recurrent DANN.
Simulation domain
In the simulation domain, all adaptation methods successfully estimated shapes during random actions. However, for obstructed crawling, the adversarially trained models (CoDATS and R-DANN) and DSN had larger errors. This result implies that the features extracted by these models fail to represent sensor dynamics. Figure 6a supports these performance comparisons in detail.

Comparison of kinematics estimation
Real domain
While all methods effectively estimated shapes for random actions using simulated sensor data, differences emerged when using physical sensor data. From Figure 6b and Table 2, we can see that vanilla LSTM struggles to accurately estimate kinematics without real-to-sim adaptation, highlighting the necessity for domain adaptation. Furthermore, the performances of adversarially trained models are limited, and R-DANN outperforms CoDATS in both domains. This result indicates that, compared to CNN layers, LSTM layers are better suited for soft robots that have high hysteresis and require long-term memory. Interestingly, DSN produces consistent errors in both simulated and real domains, implying that its kinematic model relies primarily on pressure data, as also shown in Figure 6a. The errors in our methods closely align with those observed in the simulation domain and the real-to-sim vanilla LSTM, confirming their adaptability.
In the case of obstructed crawling, generating equivalent simulated shape labels for both simulation and real-world scenarios was not feasible. Instead, we used marker positions obtained via motion tracking to calculate the error between the estimated and physical shapes. Specifically, we selected five markers located in the middle of each chamber where the largest deformations occurred. The height difference between these markers and their corresponding nodes in the finite element model was then computed. The results in Table 3 indicate that the error trends obtained using physical shape labels closely mirror those from simulated shape labels in random action scenarios. However, when it comes to obstructed crawling scenarios, the errors increase. This result implies that although shape estimation is feasible, model accuracy is lower in real-world obstructed crawling scenarios than in the simulations, as these conditions are unseen to the kinematics model. Notably, the dual AE method consistently showed the lowest error rates, while the adversarial methods performed less effectively, similar to their performance under simulated conditions.
Error Between the Shapes Estimated from the Real Sensor and the Marker Positions of the Physical Robot [mm]
Figure 7 offers a visual comparison of the estimated shapes with their errors for obstructed crawling in the simulation domain and random action in the real domain. Well-performing baseline methods, that is, DSN and real-to-sim vanilla LSTM, are shown along with our proposed model for a comprehensive comparison. A detailed comparison of the marker data and estimated shape in the physical robot across all methods is available in Supplementary Video S1.

Comparison of the estimated kinematics results with the error for each adaptation method. The upper rows show the results from the simulation domain for obstructed crawling, whereas slower rows show the results from the real domain for random action.
Collision detection results
Figure 8 displays sensor data and its errors with detection results based on our method. The figure shows that an obstruction reduces the input sensor signal. Figure 9a presents the confusion matrix used for evaluating the performance of our method. We recorded true positives (TPs) when anomalies were detected either in the current step or in the following step. This is to reflect the inherent time delays in soft robot reactions. For a more detailed analysis, we separately recorded TPs for the current and next steps. AD results for obstructed crawling in both real and simulated domains are visualized in Figure 9b and c. Based on these criteria, we evaluated the results using F1-score F1 and accuracy A, which are defined as follows:

Collision detection results of our proposed model (dual AE) with the comparison of the input sensor and reconstructed sensor signals change over time, and the robot configuration at the time spots in

Collision detection results.
A higher accuracy and F1 score indicate correct data point classification. While accuracy provides an overall measure of detection correctness, the F1 score is more appropriate for imbalanced datasets. In general, the F1 score is commonly used as an evaluation metric of anomaly detection, but in random action scenarios where TP values are not available, only accuracy is used to evaluate performance. For obstructed crawling scenarios, we calculated both metrics to provide a comprehensive evaluation of the model’s performance. The results are summarized in Table 4.
Accuracy and F1 Scores of the Collision Detection
Bolded values represent the highest value for each evaluation metric.
Simulation domain
In random action scenarios, all methods except for the prediction-based ones performed well. However, in obstructed crawling, only the reconstruction-based dual AE achieved an accuracy and F1 score of 1, while single AE and dual AE (w/out P) showed good results in shape estimation. Detailed results are shown in Figure 9b, where FNs are prevalent in other methods. This is because the decoders in the other methods can reconstruct even abnormal sensor data similarly, resulting in small reconstruction errors.
Real domain
While prediction-based methods achieve the highest accuracy, Table 4. summarizes their limitations in collision detection, leading to low F1 scores. Notably, DSN and dual AE stand out with the highest F1 scores. However, it is important to note that all methods, including these two, exhibit high FP rates in real-world scenarios, which diverges from their performance in simulations and leads to overall low accuracy. This deviation can be attributed to the inherent dynamics of the physical robot, such as less-pronounced initial deformations and delayed sensor responses to obstructions. These factors contribute to large reconstruction errors, which in turn lead to mislabeling events as collisions. In contrast to its simulation performance, dual AE shows elevated FN values because of the denoising process applied to real sensor data during latent matching. Therefore, smaller sensor variations are often smoothed out in the reconstructed data and remain undetected, leading to an increase in FN values.
Among all scenarios in both domains, it is evident that dual AE consistently shows the highest average of accuracy and F1 score. These results demonstrate the effectiveness of the reconstruction-based dual AE for detecting collisions in both simulated and real sensor domains compared to the other anomaly detection methods. Moreover, although dual AE (w/out P) showed slightly better results (by 2.57%) in shape estimation, it lagged significantly in collision detection, with a performance disparity of 10.76%. This differential highlights the advantages of integrating pressure into the decoder when performing domain adaptation and collision detection tasks.
Generalization of the model
To demonstrate the generalization of our method, we conducted additional experiments across different scenarios with varying actuation time steps and environmental setups. For detailed results, see Supplementary Section S6.
Model analysis
To identify the reduction of the gap between domains using each method, we used the t-distributed stochastic neighbor embedding (t-SNE) 35 to compare the extracted features from each domain. By projecting them from a high-dimensional space to a 2D plane using t-SNE, we can visualize and compare the characteristics of these two feature vectors. In this 2D plane, the distance between any two points indicates the similarity of the features that they represent.
As shown in Figure 10, the feature clusters from both the simulated and real domains, which represent the sensor data before domain adaptation, share a similar region but with a noticeable gap. Additionally, the flow of features within the feature vector appears disorganized, and this is likely a consequence of the nonlinear attributes of the soft sensors. In contrast, after domain adaptation, the extracted features appear linear, which is characteristic of time-series data, suggesting that the adaptation methods successfully captured the time-series property of the data. However, notably, DSN and R-DANN methods yield more distinct feature clusters. In comparison, our dual AE model shows regions where the simulated and real domains nearly overlap, indicating the proficient adaptation of the sensors in the two domains. Moreover, the distribution exhibits a more pronounced differentiation than before, underscoring the capability of our model to not only bridge the domain gap but also refine the sensor data for enhanced distinction.

Data distributions of sensor data and extracted features from the simulated and real domains.
Conclusion
In this work, we introduced unsupervised domain adaptation methodology for sim-to-real bridging of soft robot perception using dual cross-modal AE. The sensor dynamics in these heterogeneous domains are matched at the latent level, eliminating the different properties originating from both domains. Through extensive investigations, we demonstrated the effectiveness of our method compared with previously developed methods in multiple tasks that are crucial and challenging in autonomous soft robot operation. Our results show that our framework not only shows comparable performance with supervised learning in domain adaptation but even outperforms it, especially under unseen real-world conditions such as obstructed crawling. This result emphasizes the robustness and generalizability of our latent matching approach.
As mentioned in the “Introduction” section, machine learning provides a means to address the complexity of the modeling of soft robots. Although the approach is of interest in the field of generic robotics, the unique characteristics of soft robots strongly require such data-driven computation, rather than analytical and numerical formulations. For instance, under ideal conditions, highly accurate soft body simulation can achieve a computationally efficient calibration process without any domain adaptation process. However, there are challenges posed by variance in the manufacturing process and high complexity (or, often, unavailability) in soft continuum mechanics that exhibit nonlinear and contact-rich characteristics. Our simulation achieves adequate computation performance, indicating that our approach is generalizable to various soft robot designs that involve comprehensive actuation mechanisms, geometry, and perception methods and is applicable to many other perception tasks such as terrain classification and environmental recordings.
Although we demonstrated a methodology for sim-to-real transferring of sensors via the proprioceptive multigait soft robot, some limitations still remain. First, during the latent matching process, our framework denoizes real-world sensor data. Therefore, any abnormal sensor changes are less reflected in the reconstructed data, leading to less distinguishable reconstruction errors compared to those observed in simulations. To alleviate this problem, we summed the errors across all five channels for collision detection, albeit with low sensing resolution and sensitivity. In future research on reducing sensor noise during fabrication or measurement stages, collision detection accuracy can be improved and segment-specific detection may become possible. In addition, our experiment setup primarily focused on a crawling gait pattern with an obstruction, that is, a wall. This framework can be extended to various control tasks and obstructions. By incorporating higher-resolution sensors, such as those for the whole-body sensing approach, 36 we can access a richer data set. An enriched data pool can facilitate more precise classification and recognition for various perceptual tasks, including identifying the point of contact of the robot with obstacles or classifying different types of terrains and obstructions. Last, the robot used in our study consists of a simple structure with five pnuenets bending in specific directions. This simplicity allowed us to train our model using a relatively small real-world dataset (five observations of 5 min each). However, for more complex robotic structures, a larger real-world dataset would be necessary for effective model training. In such cases, incorporating the generative modeling approach 37 for data augmentation could significantly enhance the efficiency of data acquisition.
The resulting digital-twinned perception can serve as a substantial basis for learning high-level soft robot control, such as reinforcement learning. Training models in simulations across diverse configurations can enable the development of a versatile pipeline that mitigates extensive experimentation in the real world. Our framework can facilitate computationally efficient sim-to-real transfer of the learned control strategy.
Footnotes
Acknowledgment
This article has been previously published as a preprint on arXiv with the DOI 10.48550/arXiv.2310.14075.
Author Disclosure Statement
This work was supported by Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MNOTIE) (No.141518481).
Funding Information
This work was supported by the Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korean government (MNOTIE) (No. 1415184816).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
