Unsupervised Sim-to-Real Adaptation of Soft Robot Proprioception Using a Dual Cross-Modal Autoencoder

Abstract

Data-driven calibration methods have shown promising results for accurate proprioception in soft robotics. This process can be greatly benefited by adopting numerical simulation for computational efficiency. However, the gap between the simulated and real domains limits the accurate, generalized application of the approach. Herein, we propose an unsupervised domain adaptation framework as a data-efficient, generalized alignment of these heterogeneous sensor domains. A dual cross-modal autoencoder was designed to match the sensor domains at a feature level without any extensive labeling process, facilitating the computationally efficient transferability to various tasks. Moreover, our framework integrates domain adaptation with anomaly detection, which endows robots with the capability for external collision detection. As a proof-of-concept, the methodology was adopted for the famous soft robot design, a multigait soft robot, and two fundamental perception tasks for autonomous robot operation, involving high-fidelity shape estimation and collision detection. The resulting perception demonstrates the digital-twinned calibration process in both the simulated and real domains. The proposed design outperforms the existing prevalent benchmarks for both perception tasks. This unsupervised framework envisions a new approach to imparting embodied intelligence to soft robotic systems via blending simulation.

Introduction

Soft robots, composed of soft and stretchable materials, have long inspired future engineering applications toward safe, adaptive, and resilient interactions with unstructured environments and living organisms.^1–5 Unlike traditional rigid robots, the inherent mechanical compliance of soft robots offers conformability and robustness to physical contact, which in turn comes at the cost of vulnerability. Therefore, the successful operation of autonomous soft robots demands delicate proprioception, which refers to the capability to intrinsically sense its own body kinematics, mainly using soft, stretchable sensors. These soft sensors adopt extensive stimuli-responsive materials, such as liquid metals,⁶ conductive nanocomposites,⁷ and permanent magnets⁸ to perform various functionalities on demand. However, the accurate modeling of the kinematics analytically and numerically using these soft sensors is challenging owing to inconsistent manufacturing, viscoelastic hysteresis, and high nonlinearities in their dynamics.⁹

Machine learning methods have shown great success in overcoming these limitations.^10,11 Such data-driven approaches circumvent the explicit formulation of complicated, redundant soft robot dynamics. End-to-end mapping by embedded soft proprioceptive sensors is extensively leveraged for robot shape estimation,^12–15 tactile sensing,¹⁶ object identification,¹⁷ and motion control.¹⁸ However, current achievements suffer from inefficiencies in data acquisition as soft robot production varies largely depending on the manufacturing technique. In addition, the experimental process for the explicit representation of the robot shape mainly relies on optical camera measurements. Because of the use of optical markers, which require a certain gap among them, the explicit representation is confined to low-quality data, and visual occlusion occurs during large deformations. Considering these problems, sim-to-real approaches, which have been widely used in the field of generic robotics, are regarded as alternatives for optical measurements.^13,19–24 (Table 1) Visual monitoring has long been the popular choice for perception in the sim-to-real approach, as it has considerable consistency with the real world. However, the persisting desire for the autonomous operation of soft proprioceptive robots has led to the demand for soft sensor simulation, which in turn suffers from the computational complexity of the robot body. Therefore, the development and maturation of effective sim-to-real technology requires a generalizable, data-efficient sim-to-real adaptation methodology for soft robot proprioception.

Table 1.

Comparison Between Previous Sim-to-Real Approaches in Soft Robotics and Our Method

Study	Description	Task	Sensor	Hardware	Fixed end	Simulator/engine	Learning model
Du et al.²¹	Leverage gradients to improve the dynamic model and open-loop control signals via trajectory optimization	Control sequence optimization	Camera	Soft starfish	X	DiffPD³⁴	Optimization
Truby et al.¹³	Predict kinematic parameters based on experiments	Configuration estimation	Kirigami sensors	Soft arm consisted of three segments	O	Matlab (Kinematic model)	MLP, LSTM
Yoo et al.²³	Zero-shot sim-to-real transfer Need fixed end for installing camera inside the robot	Shape reconstruction (3174 points)	Camera	Omni-directional pneunet	O	SOFA (FEM)	Encoder–decoder
Gao et al.²⁴	Learning modeling error between simulated and physical systems	Residual force estimation	Camera	Silicone beam and soft pneumatic arm	O	DiffPD	Residual physics network with MLP
This work	Applicable to any type of soft robot by modifying sensor routing and numbers Required datasets in real world	Shape reconstruction (123 points) Collision detection	Soft strain sensor	Multigait soft robot	X	SOFA (FEM)	Dual autoencoder

LSTM, long short-term memory.

Herein, we propose an unsupervised domain-invariant representation learning approach as a label-free, high-performing, and generalized sim-to-real adaptation method for soft robotic perception. A dual cross-modal autoencoder (AE) enables the alignment of heterogeneous sensor domains at the latent feature level. As a proof of concept, the beneficial features of the proposed framework were examined by applying the framework to a multigait soft robot, which is one of the most popular soft robots, equipped with liquid metal (EGaIn) soft strain sensors. The calibration process was performed for dual principal perception tasks, shape estimation, and collision detection, which are predominantly involved in robotic exploration (Fig. 1b). An extensive comparative analysis with state-of-the-art methods highlighted the effectiveness of the proposed method in both simulated and real configurations toward accurate digital twinning (Fig. 1a).

FIG. 1.

Concept of our sim-to-real adaptation framework for proprioceptive soft robots. (a) Conceptual illustration of the soft robot with sim-to-real learning-based multimodal control. (b) The proposed network of domain adaptation with shape estimation and collision detection in the simulated and real domains. Simulation provides a high-fidelity mesh representation for shape estimation, while the real world relies on marker positions. In collision detection, simulation allows for generating labeled datasets more easily compared to the real world.

Proprioceptive Multigait Soft Robot

Soft robot design and fabrication

The multigait soft robot consists of five air chambers, each of which governs the bending of four legs and a central body. We used highly extensible silicone (Ecoflex 00-50; Smooth-On, Inc.) as the chamber and polydimethylsiloxane (Sylgard 184, Dow Corning) as the base. The robot was fabricated by casting on a three-dimensional (3D)-printed mold, followed by bonding of the layers with silicone adhesives. We referred to the previous work²⁵ for the detailed fabrication process.

We then EGaIn soft strain sensors in each chamber to perform proprioception. These sensors are popular in fields related to soft robotics owing to their high repeatability, fabrication scalability, and adequate deformability.²⁶ The bending deformation of the robot is measured from resistance changes ( $Δ R$ ) in the liquid metal pathway; $\propto Δ l / Δ A$ , where l is the pathway length and A is the cross-sectional area. To prevent electrical disconnection in the liquid metal pathway because of excessive chamber inflation, the EGaIn material is patterned along the edge of the air chamber, as shown in Figure 2a. The fabrication of these sensors first involves printing the EGaIn material on the inflating layer using a stencil mask. An electrical connection is then established at the end of the EGaIn pathway by attaching metallic threads, which offer a reliable connection owing to their high mechanical affinity to liquid metals. The interconnection between electrical wires (UL-AWG24) and the metallic threads is mediated by small electrical boards located on the sides of the robot body. These boards prevent thread detachment from the liquid metal by isolating the wire tension, which is mostly driven by wire pulling. After the wires and threads are connected, the wires are covered with a thin layer of Ecoflex 00-50. This step concludes the fabrication process. A graphical introduction to the fabrication process and fabricated soft robot is shown in Figure 2.

FIG. 2.

Fabrication of a proprioceptive multigait soft robot. (a) Craft flow diagram of the strain sensors on multigait soft robot. (b) Top view of the fabricated soft robot and its dimensions.

Simulation

To enable cost-efficient computation, we adopted an open-access reduced-order model of a multigait soft robot based on SOFA.²⁷ We simulated the soft sensor behavior by selecting the nodes of the modeled robot along the sensor pathway. The change in resistance at these nodes was calculated based on the sensor geometry. Following Pouillet’s law and Poisson’s ratio as done in,²⁸ the result of the sensor model can be simplified to the following relationship that depends only on sensor length: $R = R_{0} ({(1 + \frac{Δ l}{l_{0}})}^{2} - 1)$ (1)where l and R are the sensor length and resistance, respectively, and subscript ${(\cdot)}_{0}$ indicates the value at rest.

A comparison of the simulated and physical sensor data is shown in Figure 3a and b. The correlation coefficients between each sensor and the pressure input are 0.9791 and 0.8295, respectively, indicating a strong correlation. However, the average relative error between the two sensors is 205%, demonstrating a significant domain gap despite the similarities in variation between the sensor values.

FIG. 3.

Comparison of the simulated and the real worlds. (a) Rear-right leg sensor signal and its configuration, (b) the hysteresis loop between the normalized sensor signal and pressure, and (c) pressure input. In the robot model, red and green indicate the presence and absence of applied pressure, respectively.

System setup

Figure 4a and b presents the block diagram of the entire circuit and the system setup for the experiment. The environment for robot operation occupied a workspace of $1.1 \times 1.1 m^{2}$ to ensure sufficiently free robot locomotion. At the center, a wall, which served as an obstruction, was installed (Fig. 4c), and space was provided at the bottom to allow the robot to pass. The drive of the robot involved the controlled pressurization of pneumatic channels using five proportional pressure regulators (VPPM, Festo) driven by a microcontroller unit (myRIO-1900, National Instruments). The resulting pressure value was simultaneously measured by the same devices. We set the maximum pressure to 30 and 35 kPa for the legs and body, respectively, to avoid exceeding the strain limit range of the sensors. The sensor resistance value was measured by a voltage-dividing circuit with reference resistance R. The robot kinematics was measured by a motion capture system with four OptiTrack cameras (OptiTrack Prime 41, NaturalPoint Inc.). In all, 13 markers were placed on the robot—12 markers on its legs and one marker on the body—as shown in Figure 2b. The contact of the robot with the wall was measured by a sensitive load cell (SEN-14727, SparkFun Electronics) placed at the top of the wall.

FIG. 4.

Experimental setup and experimented motions of robots in the real and simulated domain. (a) Circuit block diagram of the control and sensing in the proposed proprioceptive robot with collision detection. (b) System setup with measurements, actuators, and robot. (c) Shape change for each motion over time of simulated robots and real-world robots.

Data collection

The data collection in both the simulated and real domains was performed by measurements (i.e., sensor value, pressure, kinematics, and contact force on the wall) under precisely scheduled robot locomotion. Two locomotion styles were embedded in the soft robot operation. In an unobstructed environment, we pressurized the robot with randomly generated pressure inputs to each of the five pneumatic chambers, with a frequency of 0.5 s over a duration of 5 min. Additionally, crawling motion was achieved using a sequence of seven manually designed actuation steps, each performed over 0.5 s: (i) starting from the rest state, and then pressurizing the (ii) two rear legs, (iii) central body, and (iv) front legs, and finally, depressurizing (v), (vi), and (vii) them in the same order. All the resulting measurements were performed at 100 Hz and then low-pass filtered and downsampled to 10 Hz for smoothing. We obtained six observation sets in an unobstructed environment and one observation in obstructed crawling.

The frame rate of the simulation was set at 0.02 s to fit the real-world period of measurement. For the unobstructed case, we pressurized the robot using the same pressure measured in the real world, corresponding to the aforementioned six observations. In the case of an obstructed environment, we collected 11 observations in the simulation, following an identical drive protocol as the real world. To achieve realistic system control during simulation, the pressure trajectory was emulated the same as that in the real world, as shown in Figure 3b. To be specific, we bounded the rate of the pressure change for each simulation step and gradually increased the pressure input until it reached the designated target value. As briefly described in the “Simulation” section, the sensor value was derived from the aforementioned model (1). The contact of the robot body with the wall was monitored by computing the reaction force. Last, we gathered the 3D positions of 123 nodes to train the kinematic estimation.

Sim-to-Real Adaptation with a Dual AE

Proposed network

Taking inspiration from the principles of domain adaptation, we aligned domains by matching them within the latent space. Intuitively, a shared feature space exists between these domains, aside from the unmodeled dynamics. Therefore, the domains can be bridged by creating a domain-invariant feature representation.

The core of our methodology is a long short-term memory (LSTM)-based dual AE architecture. This dual AE is designed to fulfill two primary objectives: reconstruction-based domain adaptation and anomaly detection. This strategic design ensures that the training for feature extraction intrinsically supports collision detection. We opted for a dual AE over a single AE because the significant gap between the domains necessitates separate encoders, and accurate reconstruction of each domain is crucial for precise anomaly detection. Building on this foundation, we utilized the extracted features to further train the kinematic estimation model.

The reconstruction task is integral to the AE’s functionality, facilitating the extraction of meaningful latent feature representations. For domain adaptation, the reconstruction task ensures that the latent space captures essential information from the sensor data, thereby enabling effective domain alignment. Conversely, in the context of anomaly detection, the latent features should perform poorly in unseen conditions to detect anomalies effectively. To address these conflicting requirements, we input both the latent features and pressure data into the decoder. This approach allows the decoder to focus more on its role in anomaly detection, ensuring that it does not reconstruct well in unseen conditions. To ensure that the latent features capture meaningful information from the sensors, we train them alongside the shape estimation model. This joint training ensures that the latent features are useful for both shape estimation and collision detection tasks.

The comprehensive structure of our network, including the dual AE, a kinematic estimation model, and a collision detection mechanism, is shown in Figure 5a. In our method, the dual AE is used to extract the shared latent space from each of the two sensor domains $X_{Real}, X_{Sim} \in ℝ^{5}$ , where the dimension corresponds to the number of air channels in the robot. These are alternately trained in a ratio of 5:1 for two phases, as described below.

FIG. 5.

Our proposed network of domain adaptation with multiple tasks by using a dual cross-modal autoencoder (AE). (a) Features are extracted from sensor domains and input into the kinematics estimation model or decoder along with pressure. (b) Neural network architecture for domain-invariant latent representation learning.

In the first phase, the dual AE is trained for feature extraction. Each encoder $E_{(\cdot)}$ maps the strain sensor data $X_{(\cdot)}$ to the latent space $Z_{(\cdot)} \in ℝ^{5}$ . In parallel, the decoders $D_{(\cdot)}$ are trained to reconstruct the sensor data from the extracted feature that is concatenated with pressure $P \in ℝ^{5}$ to ensure a reliable reconstruction by conditioning the control variable. Notably, this training process is performed using solely data from the scenario without an obstruction (hereinafter, unobstructed data) to ensure large reconstruction errors when the model is inferred with data from the scenario with an obstruction (hereinafter obstructed data), thereby facilitating effective collision detection. As shown in Figure 5b, the loss function in this domain adaptation phase, $L_{D A}$ , can be written as follows: $L_{D A} = L_{Reco n_{R}} + L_{Reco n_{S}} + L_{Diff},$ (2)where $L_{Reco n_{(\cdot)}}$ are the reconstruction losses, and $L_{Diff}$ is the difference loss that estimates the error between the latent variables from the simulated and real domains. These losses can be expressed as follows: $L_{Reco n_{(\cdot)}} = | | X_{(\cdot)} - D_{(\cdot)} (E_{(\cdot)} (X_{(\cdot)}) ⨁ p) | |_{2}$ (3) $L_{Diff} = | | E_{Sim} (X_{Sim}) - E_{Real} (X_{Real}) | |_{2}$ (4)where ⨁ denotes the concatenation of the tensors.

In the second phase, task-specific calibration was performed in the shared latent space. In our two tasks, kinematic estimation was learned through a neural network architecture—this learning model is referred to as the kinematic estimation model, which predicts the outermost shape of the robot $k \in ℝ^{369}$ . In line with this task learning, the encoder of the simulated domain E_Sim was simultaneously trained to lead to a latent space better favorable to the tasks. As done in the first phase, the tuned features from the simulated sensor data were concatenated with the pressure and served as an input to the kinematic estimation model. However, unlike the first phase, the training in the second phase was guided to experience the behaviors in both obstructed and unobstructed environments. The resulting training minimized the kinematic estimation loss $L_{kine}$ , which is defined as follows: $L_{kine} = | | k_{t} - K (E_{Sim} (X_{Sim}) ⨁ p) | |$ (5)where k_t is the ground truth value, and K denotes the kinematic estimation model. Algorithm 1 summarizes the overall training procedure.

We used various baseline methods to validate the beneficial features in our framework and the performance of domain adaptation and anomaly detection. For domain adaptation, supervised learning based on the LSTM (vanilla LSTM), convolutional deep domain adaptation model for time-series data (CoDATS),²⁹ recurrent DANN (R-DANN), domain separation network (DSN),³⁰ and weight-shared AE (single AE) approaches are utilized. In the case of anomaly detection, we adopt prediction-based³¹ (prediction-based real-to-sim vanilla LSTM, prediction-based dual AE) and reconstruction-based methods (reconstruction-based DSN³²). For more details on each baseline method and the network structures, see Supplementary Section S4.

Training details

For the training process, we utilized five observations of random actions in both the simulated and real worlds and ten observations of obstructed crawling in the simulated setting. From these observations, one sample of each observation set was used for validation, while others were used for training. To ensure stable and efficient training, all training data were normalized. The models were trained with the Adam optimizer³³ with an initial learning rate of $4 \times 10^{- 4}$ for the domain adaptation model and $10^{- 3}$ for the shape estimation model. Training was performed until validation loss failed to converge, exhibiting monotonous increments for 100 epochs. A weight decay coefficient of $10^{- 6}$ was applied to all the models to prevent overfitting and encourage generalization. For a fair comparison, we conducted training five times with random seeds.

Implementation of collision detection

We conducted ground truth collision labeling based on the contact force, classifying events as collisions only in instances of strong impacts. This approach is grounded in the observation that the deformation of the robot becomes abnormal (i.e., less than usual) exclusively during strong collisions. In line with these characteristics, we used two approaches for error calculation based on the model type. For reconstruction-based methods, we calculated errors only when the reconstructed sensor data exceeded the input sensor data. In contrast, for prediction-based methods, we used the absolute value of the error to identify collisions.

To establish the threshold for collision detection, we first summed the errors observed across all five sensor channels into a single time series sequence. We then calculated the average of the 10th highest error values from each type of motion (namely, random action and obstructed crawling). The final threshold was determined by obtaining the mean value from five independently trained models, thereby ensuring a more generalized result.

Given that pressure was applied at intervals of 0.5 s, we segmented the data into sets corresponding to five-time steps. Collision labeling was then implemented based on the number of data points that exceeded the predetermined threshold within each of these segments. In the simulation environment, we set the threshold number to 1 and adjusted it to 2 in the real-world experimental setup to account for potential noise.

Results

We first assessed the adaptability across sensor domains through kinematic estimation, providing a comparative analysis with baseline methods. Next, we demonstrated the effectiveness of our approach in collision detection through anomaly detection experiments. To ensure the reliability and applicability of our results, we averaged them across five distinct trained models. The performance was evaluated by comparison with the aforementioned baseline methods and with a specific variant of our method—a dual AE without pressure for the decoder, denoted as dual AE (w/out P). This comparison allowed us to investigate the effects of integrating pressure data into the decoder. We concluded our analysis by explicitly illustrating the reduction of the domain gap by comparing the latent vectors extracted from each sensor domain.

Shape estimation results

We evaluated the performance of domain adaptation in each domain for two scenarios: random action and obstructed crawling. Table 2 summarizes the results of shape estimation based on physical and simulated sensors. To assess the accuracy of shape estimation, we computed the mean absolute error for the 3D position of 123 nodes over 3000 time steps.

Table 2.

Shape Estimation Results of the Simulated and Physical Sensors

Domain	Task	Baseline methods						Proposed method
Domain	Task	Sim-Onlyvanilla LSTM	Real-to-simvanilla LSTM	CoDATS	R-DANN	DSN	Single AE	Dual AE(w/out P)	Dual AE
Simulation	Random action	624 ± 43	—	978 ± 172	778 ± 190	745 ± 65	710 ± 65	747 ± 185	780 ± 34
Simulation	Obstructed crawling	230 ± 15	—	1141 ± 129	1227 ± 153	796 ± 25	289 ± 42	295 ± 88	324 ± 59
Real	Random action	2390 ± 183	631 ± 41	1315 ± 454	843 ± 163	745 ± 65	713 ± 63	774 ± 176	782 ± 36

AE, autoencoder; CoDATS, convolutional deep domain adaptation model for time-series data; DSN, domain separation network; R-DANN, recurrent DANN.

Simulation domain

In the simulation domain, all adaptation methods successfully estimated shapes during random actions. However, for obstructed crawling, the adversarially trained models (CoDATS and R-DANN) and DSN had larger errors. This result implies that the features extracted by these models fail to represent sensor dynamics. Figure 6a supports these performance comparisons in detail.

FIG. 6.

Comparison of kinematics estimation (a) in the simulation domain during obstructed crawling and (b) in the real domain during random actions.

Real domain

While all methods effectively estimated shapes for random actions using simulated sensor data, differences emerged when using physical sensor data. From Figure 6b and Table 2, we can see that vanilla LSTM struggles to accurately estimate kinematics without real-to-sim adaptation, highlighting the necessity for domain adaptation. Furthermore, the performances of adversarially trained models are limited, and R-DANN outperforms CoDATS in both domains. This result indicates that, compared to CNN layers, LSTM layers are better suited for soft robots that have high hysteresis and require long-term memory. Interestingly, DSN produces consistent errors in both simulated and real domains, implying that its kinematic model relies primarily on pressure data, as also shown in Figure 6a. The errors in our methods closely align with those observed in the simulation domain and the real-to-sim vanilla LSTM, confirming their adaptability.

In the case of obstructed crawling, generating equivalent simulated shape labels for both simulation and real-world scenarios was not feasible. Instead, we used marker positions obtained via motion tracking to calculate the error between the estimated and physical shapes. Specifically, we selected five markers located in the middle of each chamber where the largest deformations occurred. The height difference between these markers and their corresponding nodes in the finite element model was then computed. The results in Table 3 indicate that the error trends obtained using physical shape labels closely mirror those from simulated shape labels in random action scenarios. However, when it comes to obstructed crawling scenarios, the errors increase. This result implies that although shape estimation is feasible, model accuracy is lower in real-world obstructed crawling scenarios than in the simulations, as these conditions are unseen to the kinematics model. Notably, the dual AE method consistently showed the lowest error rates, while the adversarial methods performed less effectively, similar to their performance under simulated conditions.

Table 3.

Error Between the Shapes Estimated from the Real Sensor and the Marker Positions of the Physical Robot [mm]

Task	Baseline methods						Proposed method
Task	Sim-Onlyvanilla LSTM	Real-to-simvanilla LSTM	CoDATS	R-DANN	DSN	Single AE	Dual AE(w/out P)	Dual AE
Random action	2.12 ± 0.082	1.69 ± 0.020	1.81 ± 0.182	1.79 ± 0.061	1.70 ± 0.014	1.70 ± 0.004	1.65 ± 0.015	1.68 ± 0.001
Obstructed crawling	2.14 ± 0.193	1.91 ± 0.123	1.99 ± 0.067	2.03 ± 0.098	1.92 ± 0.004	1.92 ± 0.011	1.81 ± 0.009	1.87 ± 0.049

Figure 7 offers a visual comparison of the estimated shapes with their errors for obstructed crawling in the simulation domain and random action in the real domain. Well-performing baseline methods, that is, DSN and real-to-sim vanilla LSTM, are shown along with our proposed model for a comprehensive comparison. A detailed comparison of the marker data and estimated shape in the physical robot across all methods is available in Supplementary Video S1.

FIG. 7.

Comparison of the estimated kinematics results with the error for each adaptation method. The upper rows show the results from the simulation domain for obstructed crawling, whereas slower rows show the results from the real domain for random action.

Collision detection results

Figure 8 displays sensor data and its errors with detection results based on our method. The figure shows that an obstruction reduces the input sensor signal. Figure 9a presents the confusion matrix used for evaluating the performance of our method. We recorded true positives (TPs) when anomalies were detected either in the current step or in the following step. This is to reflect the inherent time delays in soft robot reactions. For a more detailed analysis, we separately recorded TPs for the current and next steps. AD results for obstructed crawling in both real and simulated domains are visualized in Figure 9b and c. Based on these criteria, we evaluated the results using F1-score F1 and accuracy A, which are defined as follows: $\begin{array}{l} P = \frac{(T P)}{(T P + F P)}, R = \frac{(T P)}{(T P + F N)} \\ F 1 = \frac{2 \cdot P \cdot R}{(P + R)} \end{array}$ (6) $A = \frac{(T P + T N)}{(T P + T N + F P + F N)}$ (7)where TP indicates $T P_{next} + T P_{current}$ and P and R denote precision and recall, respectively. TN, FP, and FN refer to true negative, false positive, and false negative, respectively.

FIG. 8.

Collision detection results of our proposed model (dual AE) with the comparison of the input sensor and reconstructed sensor signals change over time, and the robot configuration at the time spots in (a) the simulation and (b) the real world. The data are shown after normalization.

FIG. 9.

Collision detection results. (a) Confusion matrix used in our work. TP, FN, FP, and TN indicate true positive, false negative, false positive, and true negative, respectively. Collision detection results of obstructed crawling in the (b) simulation and (c) the real world.

A higher accuracy and F1 score indicate correct data point classification. While accuracy provides an overall measure of detection correctness, the F1 score is more appropriate for imbalanced datasets. In general, the F1 score is commonly used as an evaluation metric of anomaly detection, but in random action scenarios where TP values are not available, only accuracy is used to evaluate performance. For obstructed crawling scenarios, we calculated both metrics to provide a comprehensive evaluation of the model’s performance. The results are summarized in Table 4.

Table 4.

Accuracy and F1 Scores of the Collision Detection

Domain	Task	Metric	Baseline methods			Proposed method
			Real-to-simvanilla LSTM	DSN	Single AE	Dual AE(w/out P)	Dual AE	Dual AE
			Prediction	Reconstruction	Reconstruction	Reconstruction	Reconstruction	Prediction
Simulation	Random action	A	1	1	1	1	1	0.998
	Obstructed crawling	P	—	1	0.894	0.997	1	1
		R	0	0.202	0.584	0.872	1	0.4
		F1	—	0.336	0.706	0.930	1	0.571
		A	0.665	0.734	0.838	0.956	1	0.799
Real	Random action	A	1	0.999	1	0.990	1	0.952
	Obstructed crawling	P	—	0.279	0.169	0.142	0.259	0.134
		R	0	0.629	0.247	0.210	0.423	0.346
		F1	—	0.387	0.201	0.169	0.321	0.193
		A	0.889	0.784	0.787	0.819	0.807	0.899
Mean		P	—	0.640	0.532	0.570	0.630	0.567
		R	0	0.416	0.416	0.541	0.712	0.373
		F1	—	0.504	0.466	0.555	0.668	0.450
		A	0.889	0.879	0.906	0.941	0.952	0.912

Bolded values represent the highest value for each evaluation metric.

Simulation domain

In random action scenarios, all methods except for the prediction-based ones performed well. However, in obstructed crawling, only the reconstruction-based dual AE achieved an accuracy and F1 score of 1, while single AE and dual AE (w/out P) showed good results in shape estimation. Detailed results are shown in Figure 9b, where FNs are prevalent in other methods. This is because the decoders in the other methods can reconstruct even abnormal sensor data similarly, resulting in small reconstruction errors.

Real domain

While prediction-based methods achieve the highest accuracy, Table 4. summarizes their limitations in collision detection, leading to low F1 scores. Notably, DSN and dual AE stand out with the highest F1 scores. However, it is important to note that all methods, including these two, exhibit high FP rates in real-world scenarios, which diverges from their performance in simulations and leads to overall low accuracy. This deviation can be attributed to the inherent dynamics of the physical robot, such as less-pronounced initial deformations and delayed sensor responses to obstructions. These factors contribute to large reconstruction errors, which in turn lead to mislabeling events as collisions. In contrast to its simulation performance, dual AE shows elevated FN values because of the denoising process applied to real sensor data during latent matching. Therefore, smaller sensor variations are often smoothed out in the reconstructed data and remain undetected, leading to an increase in FN values.

Among all scenarios in both domains, it is evident that dual AE consistently shows the highest average of accuracy and F1 score. These results demonstrate the effectiveness of the reconstruction-based dual AE for detecting collisions in both simulated and real sensor domains compared to the other anomaly detection methods. Moreover, although dual AE (w/out P) showed slightly better results (by 2.57%) in shape estimation, it lagged significantly in collision detection, with a performance disparity of 10.76%. This differential highlights the advantages of integrating pressure into the decoder when performing domain adaptation and collision detection tasks.

Generalization of the model

To demonstrate the generalization of our method, we conducted additional experiments across different scenarios with varying actuation time steps and environmental setups. For detailed results, see Supplementary Section S6.

Model analysis

To identify the reduction of the gap between domains using each method, we used the t-distributed stochastic neighbor embedding (t-SNE)³⁵ to compare the extracted features from each domain. By projecting them from a high-dimensional space to a 2D plane using t-SNE, we can visualize and compare the characteristics of these two feature vectors. In this 2D plane, the distance between any two points indicates the similarity of the features that they represent.

As shown in Figure 10, the feature clusters from both the simulated and real domains, which represent the sensor data before domain adaptation, share a similar region but with a noticeable gap. Additionally, the flow of features within the feature vector appears disorganized, and this is likely a consequence of the nonlinear attributes of the soft sensors. In contrast, after domain adaptation, the extracted features appear linear, which is characteristic of time-series data, suggesting that the adaptation methods successfully captured the time-series property of the data. However, notably, DSN and R-DANN methods yield more distinct feature clusters. In comparison, our dual AE model shows regions where the simulated and real domains nearly overlap, indicating the proficient adaptation of the sensors in the two domains. Moreover, the distribution exhibits a more pronounced differentiation than before, underscoring the capability of our model to not only bridge the domain gap but also refine the sensor data for enhanced distinction.

FIG. 10.

Data distributions of sensor data and extracted features from the simulated and real domains.

Conclusion

In this work, we introduced unsupervised domain adaptation methodology for sim-to-real bridging of soft robot perception using dual cross-modal AE. The sensor dynamics in these heterogeneous domains are matched at the latent level, eliminating the different properties originating from both domains. Through extensive investigations, we demonstrated the effectiveness of our method compared with previously developed methods in multiple tasks that are crucial and challenging in autonomous soft robot operation. Our results show that our framework not only shows comparable performance with supervised learning in domain adaptation but even outperforms it, especially under unseen real-world conditions such as obstructed crawling. This result emphasizes the robustness and generalizability of our latent matching approach.

As mentioned in the “Introduction” section, machine learning provides a means to address the complexity of the modeling of soft robots. Although the approach is of interest in the field of generic robotics, the unique characteristics of soft robots strongly require such data-driven computation, rather than analytical and numerical formulations. For instance, under ideal conditions, highly accurate soft body simulation can achieve a computationally efficient calibration process without any domain adaptation process. However, there are challenges posed by variance in the manufacturing process and high complexity (or, often, unavailability) in soft continuum mechanics that exhibit nonlinear and contact-rich characteristics. Our simulation achieves adequate computation performance, indicating that our approach is generalizable to various soft robot designs that involve comprehensive actuation mechanisms, geometry, and perception methods and is applicable to many other perception tasks such as terrain classification and environmental recordings.

Although we demonstrated a methodology for sim-to-real transferring of sensors via the proprioceptive multigait soft robot, some limitations still remain. First, during the latent matching process, our framework denoizes real-world sensor data. Therefore, any abnormal sensor changes are less reflected in the reconstructed data, leading to less distinguishable reconstruction errors compared to those observed in simulations. To alleviate this problem, we summed the errors across all five channels for collision detection, albeit with low sensing resolution and sensitivity. In future research on reducing sensor noise during fabrication or measurement stages, collision detection accuracy can be improved and segment-specific detection may become possible. In addition, our experiment setup primarily focused on a crawling gait pattern with an obstruction, that is, a wall. This framework can be extended to various control tasks and obstructions. By incorporating higher-resolution sensors, such as those for the whole-body sensing approach,³⁶ we can access a richer data set. An enriched data pool can facilitate more precise classification and recognition for various perceptual tasks, including identifying the point of contact of the robot with obstacles or classifying different types of terrains and obstructions. Last, the robot used in our study consists of a simple structure with five pnuenets bending in specific directions. This simplicity allowed us to train our model using a relatively small real-world dataset (five observations of 5 min each). However, for more complex robotic structures, a larger real-world dataset would be necessary for effective model training. In such cases, incorporating the generative modeling approach³⁷ for data augmentation could significantly enhance the efficiency of data acquisition.

The resulting digital-twinned perception can serve as a substantial basis for learning high-level soft robot control, such as reinforcement learning. Training models in simulations across diverse configurations can enable the development of a versatile pipeline that mitigates extensive experimentation in the real world. Our framework can facilitate computationally efficient sim-to-real transfer of the learned control strategy.

Footnotes

Acknowledgment

This article has been previously published as a preprint on arXiv with the DOI 10.48550/arXiv.2310.14075.

Author Disclosure Statement

This work was supported by Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MNOTIE) (No.141518481).

Funding Information

This work was supported by the Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korean government (MNOTIE) (No. 1415184816).

Supplementary Material

References

Justus

, Hellebrekers

, Lewis

, et al. A biosensing soft robot: Autonomous parsing of chemical signals through integrated organic and inorganic interfaces. Sci Robot, 2019; 4(31):eaax0765.

Wang

, Totaro

, Beccai

. Toward perceptive soft robots: Progress and challenges. Adv Sci (Weinh), 2018; 5(9):1800541.

Shih

, Shah

, Li

, et al. Electronic skins and machine learning for intelligent soft robots. Sci Robot, 2020; 5(41):eaaz9239.

Rus

, Tolley

. Design, fabrication and control of soft robots. Nature, 2015; 521(7553):467–475.

Rothemund

, Kim

, Heisser

, et al. Shaping the future of robotics through materials innovation. Nat Mater, 2021; 20(12):1582–1587.

Park

, Majidi

, Kramer

, et al. Hyperelastic pressure sensing with a liquid-embedded elastomer. J Micromech Microeng, 2010; 20(12):125029.

Yamada

, Hayamizu

, Yamamoto

, et al. A stretchable carbon nanotube strain sensor for human-motion detection. Nat Nanotechnol, 2011; 6(5):296–301.

Ozel

, Keskin

, Khea

, et al. A precise embedded curvature sensor module for soft-bodied robots. Sensors Actuators A Phys, 2015; 236:349–356.

Thuruthel

, Shih

, Laschi

, et al. Soft robot perception using embedded soft sensors and recurrent neural networks. Sci Robot, 2019; 4(26):eaav1488.

10.

Kim

, Kim

, et al. Review of machine learning methods in soft robotics. PLoS One, 2021; 16(2):e0246102.

11.

Chin

, Hellebrekers

, Majidi

. Machine learning for soft robotic sensing and control. Adv Intelligent Syst, 2020; 2(6):1900171.

12.

Truby

, Chin

, Zhang

, et al. Fluidic innervation sensorizes structures from a single build material. Sci Adv, 2022; 8(32):eabq4385.

13.

Truby

, Della Santina

, Rus

. Distributed proprioception of 3d configuration in soft, sensorized robots via deep learning. IEEE Robot Autom Lett, 2020; 5(2):3299–3306.

14.

Loo

, Ding

, Baskaran

, et al. Robust multimodal indirect sensing for soft robots via neural network-aided filter-based estimation. Soft Robot, 2022; 9(3):591–612.

15.

Van Meerbeek

, De Sa

, Shepherd

. Soft optoelectronic sensory foams with proprioception. Sci Robot, 2018; 3(24):eaau2489.

16.

Han

, Kim

, et al. Use of deep learning for characterization of microfluidic soft sensors. IEEE Robot Autom Lett, 2018; 3(2):873–880.

17.

Shih

, Drotman

, Christianson

, et al. Custom soft robotic gripper sensor skins for haptic object visualization. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). ISRO, 2017; 494–501; doi: 10.1109/IROS.2017.8202199

18.

Thuruthel

, Falotico

, Renda

, et al. Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators. IEEE Trans Robot, 2019; 35(1):124–134.

19.

Graule

, McCarthy

, Teeple

, et al. Somogym: A toolkit for developing and evaluating controllers and reinforcement learning algorithms for soft robots. IEEE Robot Autom Lett, 2022; 7(2):4071–4078.

20.

Graule

, Teeple

, McCarthy

, et al. Somo: Fast and accurate simulations of continuum robots in complex environments. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). ISRO, 2021; 3934–3941; doi: 10.1109/IROS51168.2021.9636059

21.

, Hughes

, Wah

, et al. Underwater soft robot modeling and control with differentiable simulation. IEEE Robot Autom Lett, 2021; 6(3):4994–5001.

22.

Schaff

, Sedal

, Walter

. Soft robots learn to crawl: Jointly optimizing design and control with sim-to-real transfer. arXiv Preprint arXiv, 2022:2202.04575.

23.

Yoo

, Zhao

, Altamirano

, et al. Toward zero-shot sim-to-real transfer learning for pneumatic soft robot 3d proprioceptive sensing. arXiv Preprint arXiv, 2023:2303.04307.

24.

Gao

, Michelis

, Spielberg

, et al. Sim-to-Real of Soft Robots with Learned Residual Physics. arXiv Preprint arXiv, 2024:2402.01086.

25.

Shepherd

, Ilievski

, Choi

, et al. Multigait soft robot. Proc Natl Acad Sci U S A, 2011; 108(51):20400–20403.

26.

Dickey

. Stretchable and soft electronics using liquid metals. Advanced Materials, 2017; 29(27):1606425.

27.

Goury

, Duriez

. Fast, generic, and reliable control and simulation of soft robots using model order reduction. IEEE Trans Robot, 2018; 34(6):1565–1576.

28.

Tapia

, Knoop

, Mutný

, et al. Makesense: Automated sensor design for proprioceptive soft robots. Soft Robot, 2020; 7(3):332–345.

29.

Wilson

, Doppa

, Cook

. Multi-source deep domain adaptation with weak supervision for time-series sensor data. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, 2020; 1768–1778.

30.

Bousmalis

, Trigeorgis

, Silberman

, et al. Domain separation networks. Advances in Neural Information Processing Systems, 2016:29.

31.

Malhotra

, Vig

, Shroff

, et al. Long short term memory networks for anomaly detection in time series. ESANN, 2015; 2015:89.

32.

Yang

, Soltani

, Darve

. Anomaly detection with domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVF, 2023, pp. 2957–2966.

33.

Kingma

. Adam: A method for stochastic optimization. arXiv Preprint arXiv, 2014:1412.6980.

34.

, Wu

, Ma

, et al. Diffpd: Differentiable projective dynamics. ACM Trans Graph, 2022; 41(2):1–21.

35.

Van der Maaten

, Hinton

. Visualizing data using t-sne. J Machine Learning Res, 2018; 9(11).

36.

Park

, Park

, Mo

, et al. Deep neural networkbased electrical impedance tomographic sensing methodology for large-area robotic tactile sensing. IEEE Trans Robot, 2021; 37(5):1570–1583.

37.

Sapai

, Loo

, Ding

, et al. A deep learning framework for soft robots with synthetic data. Soft Robot, 2023; 10(6):1224–1240.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

81.52 MB

1.29 MB

0.00 MB