Abstract
Despite the development of numerous soft grippers designed to handle deformable objects, hardness sensing remains a challenge, yet it is essential for various applications such as product selection or sorting, assessing fruit ripeness, or food quality control. This research introduces GripDepthSense3DNet, an innovative approach integrating 3D depth sensing with machine learning for accurate hardness sensing during grasping. Leveraging a dataset comprising of depth images of diverse objects undergoing deformation, the proposed novel network is trained to capture intricate spatial–temporal deformation features from a series of depth images. GripDepthSense3DNet outperforms state-of-the-art networks, exhibiting a commendable mean absolute percentage error of 0.46% for trained shapes and hardness. Specifically, the model achieves a reduction in parameters of approximately 94.8% compared to ResNet-50, with a training time that is around 92.9% shorter on equivalent hardware. Different depth ranges and intervals were studied to eventually arrive at an optimal configuration. Through dynamic tuning, the network’s ability to seamlessly incorporate new shapes, new hardness, and even intricate arbitrary objects highlights the adaptability of the approach.
Introduction
Soft robotic systems are increasingly being deployed in real-world applications that demand adaptability in interacting with the environment. Achieving such capabilities requires the integration of advanced perception and manipulation techniques.
Recently, there has been a significant development of soft grippers designed to handle delicate and deformable objects. The ability to grasp and manipulate such objects is crucial, as many real-world scenarios involve items with varying degrees of deformability. Effective grasping approaches for soft grippers are commonly categorized into gripping by actuation, controlled stiffness, or controlled adhesion. Gripping by actuation can be achieved through contact-driven technology, 1 fluidic-elastomer-actuators,2–7 or electroactive-polymers. 8 Grippers employing controlled stiffness9,10 are initially put in a soft configuration before the stiffening mechanism is activated to achieve object grasping. As for controlled adhesion, the two major adhesion technologies are electro-adhesion and gecko-adhesion. 11
Sensing plays a crucial role in the effectiveness of soft grippers, as it provides the basis of informed decision-making. This can be done through pure tactile sensing,12–14 visual sensing using an encased 2D or 3D camera,15–24 or force-based sensing.25–27 Multimodal sensing is possible with the combination of two or more sensing techniques. 28 Visual sensing may yield a greater spatial resolution, but achieving high sensitivity to tactile interactions comparable to traditional tactile sensing poses a challenge. Besides, tracking markers may be required,18–24 specifically when depth information is not readily available.
Object rigidity is typically assumed when object deformability poses no impediment to the task. If it becomes a significant consideration, the integration of robust capabilities for sensing deformable objects becomes imperative. Within the realm of deformation sensing, hardness/stiffness sensing18,19,29–32 finds practical applications in product selection or sorting, assessing fruit ripeness,33,34 and ensuring material or food quality control.
Machine learning approaches can be applied to sensing applications such as object recognition,14,35 deformation tracking,36,37 and hardness/stiffness sensing.19,26 Effective sensing may be achieved via a convolutional neural network (CNN) such as the popular AlexNet, 22 ResNet series, 16 VGG16, 19 or even a custom one. 14
Table 1 presents a collection of sensing-related studies15,16,19–22,25,32,38–41 employing a soft contact material. 3D CNN architectures have been employed for tactile object recognition,38,39 whereas time-series data has shown significant utility in two-class stiffness classification. 40 However, with the exception of two studies,19,32 the rest do not focus on single-label hardness estimation. Notably, sensors such as GelSight 19 require markers on the membrane. For the soft durometer, 32 compressive deflection of the embedded magnetic probe is measured. Among the vision-based studies, only two have leveraged a 3D camera,15,16 which eliminates the need for markers on the membrane.
Summary of Literature Survey on Selected Object-Sensing Studies Utilizing Soft Contact Materials, Which Undergo Deformation upon Contact with Objects, Thereby Revealing Valuable Insights into the Object’s Properties
The GelSight sensor in Yuan et al. 19 has a function similar to the proposed work, but object grasping is not shown, markers on the membrane are inevitable, and the network complexity is evident. A 3D time-of-flight camera cannot be used in Sakuma et al. 20 due to the presence of filling. In Hughes et al., 25 limited spatial resolution is observed, and sensing ability is limited to hard objects. In comparison to utilizing a soft durometer 32 with a force sensor, vision-based sensing delivers comprehensive information about the entire object surface, enhancing the understanding of its hardness distribution. Conversely, the soft durometer may only provide hardness measurements at specific contact points.
In contrast, our work presents a novel vision-based approach for estimating the hardness of objects in a scene using a novel custom 3D CNN architecture designed to efficiently process depth images obtained from a single depth camera embedded inside a robotic gripper. Vision-based sensing decouples the sensing mechanism from the contact material, thereby eliminating the need for redesign or extensive recalibration of sensing elements across different applications, making it adaptable for various gripper designs. Compared to other sensing methodologies, it facilitates easier maintenance, as the sensing components experience minimal wear and tear. Besides, the proposed approach provides a nondestructive sensory-guided robotic perception, meaning that it senses object properties without causing any permanent physical alteration, damage, or wear to the objects being examined. This is achieved by fusing the domains of soft robotics and visual sensing, facilitating informed and adaptive interactions with the environment. Furthermore, it boasts higher spatial resolution compared to tactile sensors.38–40 If required, the captured image may serve additional purposes such as object classification.
The methodology will be detailed in the section “Materials and Methods.” Section “Experiment Results and Analysis” presents both the experimental results and an in-depth analysis of the findings. Section “Conclusion” concludes the work presented.
Materials and Methods
We present a novel hardness sensing framework designed for use during the grasping of soft and deformable objects by a soft robotic gripper featuring a deformable marker-less contact membrane (Fig. 1a). Integrated within the gripper is a depth camera, capturing the deformations of the membrane resulting from the interaction between the gripper and the grasped object. Depth images are systematically captured at various depth levels throughout the grasping sequence. Our custom neural network is designed to leverage the power of deep learning in processing spatial–temporal data, enabling a specialized approach to decipher complex deformation patterns and establish correlations with the hardness values of a diverse array of objects.

The used gripper draws inspiration from the particle jamming gripper,9,10 a widely used soft gripper known for its capability to grasp objects of various geometries. However, we modified the gripper design to exclude the granular material inside, enabling the embedded camera to accurately observe membrane deformation. Instead, we utilize flexible fingers to securely hold the object. Our hardness sensing model can output a single hardness value (regression), rather than simply assigning a hardness class (classification). This enhancement allows for more detailed hardness sensing during object manipulation.
Used gripper design
The used gripper (Fig. 2) adopts a modular design for enhanced flexibility, cost-effectiveness, and easy maintenance, and facilitates modifications such as adjusting the number or length of fingers, the depth camera, or the robotic manipulator, with only the affected part requiring modification.

Renderings of the custom soft robotic gripper design.
A thin and opaque marker-less membrane, cut from a large balloon, is affixed to the gripper’s exterior. This contact membrane, when pressed against an object, undergoes deformation. The tension on the membrane then compels the flexible fingers to close. The membrane deformations during object grasping contain valuable information on the object’s hardness.
To ensure the fingers and membrane maintain optimal elasticity, the gripper’s custom software overlays fiducials on the live depth camera feed, outlining the desired effective contact area. This allows any degradation in elasticity to be easily detected on the software interface through visible changes in the contact area.
Fabrication and collection of soft objects
For network development and evaluation, 24 soft objects with basic shapes, 16 with intricate shapes, and several everyday objects were included, as pictured in Figure 3. Except for the everyday objects, the remaining ones were manufactured using SMOOTH-ON Ecoflex, with six varying hardness levels for the training objects, achieved by adjusting the percentage of SMOOTH-ON Slacker added. The hardness values of the training objects span from approximately 16 to 68 H00 (see Supplementary Data S1 for the ground-truth hardness values). This study is constrained to homogeneous objects, wherein the entire object is constructed from the same material. Hardness assessment is conducted using a handheld durometer compliant with the ASTM D2240 standard, featuring an accuracy of ±1 H00. The measurements are averaged from five readings taken at various points on each object.

Fabricated soft objects to ensure the diversity of the objects used in terms of shape and hardness. The choice of basic shapes (sphere, cylinder, hexagon, and cube) stems from their prevalence in benchmarking scenarios, with each shape being distinct from the others. To further challenge the network’s adaptability, more complex shapes were introduced during the dynamic tuning experiment, showcasing the network’s ability to learn and generalize to novel shapes.
Fabricating intricate soft objects using silicone rubber allows for identical-looking objects with varying hardness levels. This approach is to focus solely on the potential deformation of the contact membrane, making color patterns irrelevant; the primary factor of interest lies in the objects’ differing hardness levels.
Robotic sequencing
The robot manipulator employed is the ABB-IRB-120 model. The programmed grasping sequence is shown in Figure 4. The closure of fingers is facilitated by the pneumatics subsystem, which includes a vacuum generator providing negative pressure to facilitate secure gripping.

Robotic grasping sequence with a rose-shaped soft object.
The pneumatic air supply is first regulated to a constant pressure of 4 bar (approximately 58.02 psi). This air pressure remains fixed for all objects, as the flexible fingers are designed to adaptively conform to the shape of the objects, allowing for effective grasping without the need to adjust the air pressure, thereby eliminating the need for an additional feedback loop. The air supply is then linked to a solenoid valve, operated by a digital signal from the robot, whereas the valve’s output connects to the vacuum generator, inducing negative air pressure for grasping. To release the object, the vacuum supply is deactivated, causing the air pressure inside the gripper to increase, opening the fingers, and releasing the object.
For data collection, the robotic sequence resembles that of object grasping, where the gripper descends from its initial position to the 40 mm position (in increments of 1 mm), whereas the pneumatic subsystem remains inactive. Due to the tension applied by the membrane upon contact with the object, the fingers are compelled to move and close inward. Following this, the robot ascends to a holding position (without the object as grasping was not completed) and undergoes a 3° clockwise rotation. This cycle is reiterated for orientations within the first quadrant, spanning from 0° to 87°. There is no requirement to deliberately reposition the object at various locations on the platform. When the gripper makes contact with the object, it naturally aligns the object to a centralized position within the gripper. Nevertheless, the robot’s motion may inadvertently induce minor movements in the object during each cycle, introducing a certain level of variability into the collected dataset.
Data collection encompasses all objects in Figure 3. Collecting data for all 41 depths (0–40 mm) allows for the exploration of various factors, such as employing different numbers of depth levels, adjusting the depth range, and experimenting with different depth intervals.
Generation of depth images
The visual sensor used, shown at the top of Figure 1b, is the CamBoard pico flexx 3D time-of-flight (ToF) depth camera, 42 which has a spatial resolution of 224 × 172 pixels, a depth resolution of ≤2% of distance, and a measurement range of 0.1–4 m. It has been successfully used in deformation-sensing applications.15,16
The deformation of our gripper’s contact membrane during grasping falls within the camera’s measurement range. Using its ToF-based point cloud measurements, the depth information of the membrane can be captured without requiring markers. To achieve optimal representation of deformations in 8-bit grayscale depth images (with intensity values ranging from 0 to 255), our custom gripper software provides options to adjust threshold distances. This flexibility accommodates design changes in the gripper and ensures the depth resolution remains sensitive enough to capture fine details. A transformation process maps raw sensor readings into pixel intensity values according to a predefined depth range and direction, enhancing visualization accuracy (as depicted in Fig. 1b).
After generating the depth image, a square region of interest with dimensions of 150 × 150 pixels is cropped to center the focus on the object. Besides reducing the dimensions by 42%, this step removes unnecessary portions of the original image which could introduce unnecessary complexity and potentially confuse the neural network if retained. Next, a median filter with a kernel size of 3, the smallest effective size, is applied to suppress noise from the camera. Any additional resizing needed for input to the hardness sensing network will be handled in later stages.
Since the robotic motion only covers orientations in the first quadrant, images representative of the remaining quadrants are generated by digitally rotating the captured images. This approach significantly streamlines the data collection process using the robot manipulator. With more variations added, the total number of images is increased from 55,350 to 221,400. The increased diversity in the dataset allows the network to better generalize across different orientations, improving its performance when making predictions. Figure 5 shows a snippet of the dataset.

Examples of depth images collected.
Although subtle differences are apparent in the images of objects with the same shape but varying in hardness when the depth is kept constant (Fig. 5c), relying solely on a single depth level for hardness estimation is inadequate. This limitation arises from the model’s failure to capture the spatial–temporal features resulting from the evolutionary variations in membrane deformations during the interaction between the gripper and the object, which are related to the hardness value.
Custom network design
Although 2D CNNs have proven to be highly effective in various image recognition tasks, they may not be the most desirable choice for hardness sensing. A key limitation is their inability to naturally handle temporal dependencies in a sequence of images, potentially leading to suboptimal performance. The solution is to stack a series of depth images into a 3D input volume and perform several 3D convolution operations to extract the spatial–temporal nuances. Using a 3D input volume (
For optimization, the adjustment for stride and padding is incorporated. The stride determines the step size between successive positions at which the convolution is applied, and the padding adds extra layers of zeros around the input volume to control the output dimensions.
To reduce the spatial dimensions which are notably larger than the temporal dimension, we selectively apply the max-pooling operation to the spatial dimensions, preserving the temporal features while downsampling spatial aspects.
Metrics
We employ mean absolute percentage error (MAPE) as a metric to evaluate the accuracy of hardness sensing, and standard deviation (SD) to measure the variability or dispersion of the MAPE values. The following is the equation for MAPE:
Where
Given that we employ a 10-fold cross-validation, the overall MAPE, denoted as
Investigation of GripDepthSense3DNet
We commence network development with a foundational architecture (Group A in Fig. 6a). Reducing the temporal dimension before the interpretation stage aims to distill the most essential patterns, forming a condensed representation. During the interpretation stage, the network comprehends complex features, leading to an output reflecting the estimated hardness level based on learned features.

In each feature extraction block, the 3D convolution (Conv3D) layer is adept at capturing complex patterns within the depth image series. The function of max-pooling in our network is to reduce only the spatial dimension of the feature maps, retaining the most salient information while discarding less relevant details. The parameters used are detailed in Table 2.
GripDepthSense3DNet (GDS3DNet) Variants
Conv3D and MaxPool3D are layers of the networks. The developmental networks are categorized into Groups A, B, and C, depending on their designs, with each network assigned a unique identifier.
This table summarizes the diverse parameters, including variations in filter size (f), kernel size (k), stride (s), padding (p), and number of homogeneous convolutional blocks (N), employed across the studied GripDepthSense3DNet architectures. For instance, a kernel size of (2,3,3) in the Conv3D layer indicates a kernel size of 2 in the temporal dimension and (3,3) in the spatial dimensions. The naming convention adopted is GripDepthSense3DNet-x[sy][kz], where x denotes the total number of convolutional layers in the network, y represents the stride, and z signifies the kernel size. Note that the last two variables in the identifier (y and z) are applicable only if they distinguish one variant from another within the group, and the values of y and z do not denote the strides and kernel sizes for all convolutional layers; instead, they specifically refer to the differentiating factors. The filter sizes come in powers of 2, that is, 32, 64, 128. This practice aligns with a common approach in deep learning to leverage computational advantages while maintaining model performance.
In Figure 6a, Group B represents an improvement over Group A by incorporating additional convolutional layers to enhance the feature extraction process. Multiple configurations are explored, including the introduction of homogeneous intermediate feature extraction blocks before the final block. Emphasis is placed on adjusting the stride to expedite training time while maintaining optimal performance. Beyond the GDS3DNet-4s4 variant, the kernel size of the second convolutional layer is modified to (3, 3, 3) to improve the model’s ability to capture spatial–temporal dynamics in the depth image series, with the first dimension of the kernel representing the temporal axis, that is, capturing changes across different frames. Although this modification slightly increased training time, its effectiveness has been empirically validated through an ablation experiment, which demonstrated significant reductions in MAPE values in subsequent iterations. Without this adjustment, later iterations would have been constrained by the limited 3D feature extraction capability of the original kernel configuration. Additionally, incorporating padding in the first convolutional layer becomes essential to maintain the network’s integrity, preventing a drastic reduction in the output feature map size during the initial stages where spatial and temporal nuances might not have been adequately extracted.
In Group C (see Fig. 6a), the first two feature extraction blocks serve to capture foundational features and reduce spatial dimensions, whereas the subsequent convolutional layers progressively extract more complex patterns and high-level representations. The homogeneous feature extraction blocks are crafted to exclusively comprise Conv3D and rectified linear unit (ReLU) layers, considering that feature maps are inherently compact during this segment of the network. Introducing multiple max-pooling operations at this segment could render the convolutional layers inoperative. This design consideration aligns with the approach in the popular AlexNet 43 architecture, which also avoids the use of max-pooling layers in its intermediate sections.
In all variants, we progressively increase the number of filters, from the first feature extraction block to the last. As the number of filters grows, the network gains the ability to discern a broader range of features extracted. It also allows the model to learn hierarchical representations, capturing both low and high-level features that contribute to the overall understanding of the data.
GDS3DNet-7 (Fig. 6b) is selected as the optimal network, boasting the lowest overall MAPE and satisfactory training time (Table 3). The addition of convolutional layers in GDS3DNet-8 and GDS3DNet-9 did not yield improvements in either MAPE or SD, despite the heightened complexity and increased computational cost, suggesting that the introduction of additional layers would likely lead to overfitting.
Performance Metrics for GripDepthSense3DNet Variants
Key considerations during model development encompass the number of parameters, training time, overall MAPE (mean value computed from the 10-fold cross-validation), and standard deviation.
MAPE, mean absolute percentage error; SD, standard deviation.
A 10-fold cross-validation process ensures robust evaluation. However, the training time is recorded for a single iteration, mirroring real-world scenarios where training is typically performed once. Training is done using NVIDIA GeForce GTX 960M graphics processing unit with 8.0 GB memory. On average, the testing time for each hardness sensing instance using GDS3DNet-7 is 102 milliseconds, provided that the required images and dependency modules are preloaded. The sensing process can be integrated into pick-and-place operations. For reference, the gripping time alone for a commercially available food-grade soft gripper 44 is 0.32 s, whereas a soft gripper discussed in the literature 45 shows a range of 0.34–0.70 s.
Experiments
For consistency across all training sessions, a standardized protocol is employed, involving 400 epochs, a batch size of 10, and the Adam optimizer with a learning rate set at 0.001. The utilization of a small learning rate is imperative to facilitate the model’s gradual convergence toward the minimum loss. The choice of 400 epochs ensures that the network is sufficiently trained. The dataset undergoes randomization and is subsequently partitioned into mutually exclusive sets for training, validation, and testing. Generally, 10% of the data is reserved for the test set, whereas the remaining 90% is further split into 80% for training and 20% for validation. The model parameters will be saved only if there is an improvement in terms of validation loss. Dynamic tuning experiments employ different splitting percentages, which will be detailed in the corresponding subsections.
Performance of GripDepthSense3DNet on training objects
In the initial phase, we select the optimal GDS3DNet-7, and proceed to train it using the dataset comprising the 24 training objects shown in Figure 3a. At this juncture, the depth interval remains fixed at 5 mm. The input volume format is shown in Figure 7a. The number of depth levels is configured to four to ensure a fair comparison with state-of-the-art networks, as four subimages of 128 × 128 pixels can precisely tile a square 2D image with no unutilized pixels (Fig. 7c). The initial experiment explores four different depth ranges: 10–25 mm, 15–30 mm, 20–35 mm, and 25–40 mm.

Input volume/image formed using the depth image array.
The next phase of optimization involves adjusting the depth interval (5, 10, and 15 mm) and the number of depth levels (ranging from three to six). The input volume format is illustrated in Figure 7b.
Benchmarking with state-of-the-art networks
The performance of GDS3DNet-7, with the optimized depth range, is benchmarked against AlexNet, 43 DenseNet-121, 46 Inception-v3, 47 ResNet-18, ResNet-34, ResNet-50, ResNeXt-50, 48 and ViT-T/16. 49 The inclusion of different ResNet 50 variants aims to assess the influence of network depth on hardness sensing accuracy. AlexNet is considered for its distinctive architecture, providing insights into its performance relative to more contemporary network architectures.
As they work on 2D convolution, the depth images are concatenated to form a unified input image (Fig. 7c). This maintains consistency with GDS3DNet-7, as the number of input neurons remains the same at 65,536, derived from 128 × 128 × 4 voxels. The number of parameters for GDS3DNet-7 is 1,228,321, whereas the parameter counts for the compared networks are provided in the section “Fabrication and collection of soft objects.”
Dynamic tuning for untrained shapes/hardness
For experiments related to dynamic tuning for novel shapes or hardness, each iteration involves excluding images of one specific shape or hardness from training. Upon completion of the initial training, the layers within the feature extraction stage are frozen. Once the feature extraction stage is sufficiently trained to extract intricate features related to the hardness attribute, novel elements can be integrated merely by updating the weights and biases of the fully connected layers using a small set of images. For this purpose, 90% of the excluded images are designated as the test set, with 5% assigned to the validation set and another 5% to the training set for dynamic tuning.
Dynamic tuning for novel objects
Similarly, when introducing complex-shaped objects or everyday objects with untrained shapes and hardness into the network, dynamic tuning can be achieved by utilizing a small set of the new images. This method obviates the need for retraining the entire network, allowing for targeted tuning tailored to applications with specific objects.
Experiment Results and Analysis
The performance and optimization of GripDepthSense3DNet are showcased, along with a comparative analysis against state-of-the-art networks. Following this, dynamic tuning results are detailed.
Performance of GripDepthSense3DNet on training objects
In Table 4, when comparing the depth ranges of 10–25 mm and 15–30 mm. No improvement is noted, suggesting the object deformation within these ranges is not substantial for effective recognition of hardness variations.
Evaluation of Hardness Sensing Using GDS3DNet-7 Under Varying Depth Ranges
For the sake of consistent comparisons, the depth interval is fixed at 5 mm, and the number of depth levels is maintained at four.
A notable improvement is witnessed when transitioning to 20–35 mm, resulting in a remarkable 62% decrease in MAPE. This trend of improvement continues when shifting the depth range to 25–40 mm, yielding the lowest MAPE of 0.46%. Likewise, the SD for this range is the lowest among all cases at 0.79, indicating enhanced stability and consistency. Greater depths are preferred likely due to the increased deformation compared to shallower depths, which enhances the network’s ability to discern and identify hardness variations.
Figure 8a shows the ground-truth hardness values of different objects. With a depth range of 25–40 mm, Figure 8b demonstrates the model’s output values closely mirroring the actual hardness values, indicating a strong correlation between the model’s predictions and the ground-truth data. Figure 8c and d present the breakdown of results based on shapes and hardness. The analysis reveals that the MAPE is highest for the sphere, followed by the hexagon, cube, and cylinder. The elevated MAPE for the sphere can be attributed to its curved nature, aligning with the curvature of the membrane when the fingers close, yielding significantly less deformation. Concerning hardness, a consistent and descending MAPE trend is noted from Hardness I to Hardness VI. This reiterates that increased deformation enhances hardness sensing capabilities, as softer objects exhibit more pronounced object deformation during grasping.

From Table 5, although the lowest MAPE is achieved using three depths with a 15 mm interval, we opt for the configuration of four depths with a 5 mm interval for the ensuing experiments. One reason is that it yields the lowest SD value while maintaining a relatively low MAPE value, emphasizing a trade-off between optimizing accuracy and minimizing variance. Another rationale for maintaining four depths is to ensure consistency in input data size for benchmarking against state-of-the-art networks.
Configurations for Network Optimizations, Along with the Corresponding Results in Terms of Overall MAPE and Standard Deviation
Building on the knowledge gained earlier that greater depths are preferred, the depth levels are selected such that 40 mm is the final depth.
Although the aforementioned results were obtained using a 10-finger gripper, the robustness of GripDepthSense3DNet has been validated by executing hardness sensing with a 5-finger gripper, using the optimal depth configuration of 4 depths and 5 mm interval. The resulting MAPE is found to be 0.36%, indicating that the network could be trained with a different gripper configuration (see Supplementary Data S2).
Benchmarking with state-of-the-art networks
In Figure 9a, GDS3DNet-7 demonstrates accelerated convergence to the minima when compared to alternative architectures. Upon reaching a steady state, the error values consistently maintain a lower profile than those of other models. Referring to Figure 9b and Table 6, GDS3DNet-7 has the lowest MAPE and SD values, as well as the shortest training time. A higher number of parameters does not guarantee improved performance, such as in the case of AlexNet, as this abundance of parameters may lead to overfitting and inefficiencies in learning the hardness-related features. Compared to ResNet-50, which achieves a MAPE of 0.60% as the best-performing state-of-the-art network in the comparison, GDS3DNet-7 delivers an impressive 94.8% reduction in parameters while reducing training time by approximately 92.9% on equivalent hardware. Figure 10 presents a detailed breakdown of results based on object shapes and hardness levels, for several comparable networks.


Breakdown of several comparable state-of-the-art networks’ results according to MAPE and SD, where “M” denotes the mean values. For conciseness, only the best-performing ResNet variant is shown, and the remaining networks for this figure are chosen based on their performance. Results for GDS3DNet-7 have been shown earlier.
Performances of GripDepthSense3DNet-7 and State-of-the-Art Networks
The ResNet variant exhibiting the closest accuracy to GDS3DNet-7 is ResNet-50, albeit still having a higher overall MAPE.
For DenseNet-121 and ResNeXt-50, the training times were not recorded as the training could not be completed on the same local computer due to memory limitations. Consequently, the MAPE and SD values were obtained using a high-performance computing (HPC) platform with 64 GB of memory. As these two networks perform more poorly overall, the training time is not considered an essential marker for them.
Dynamic tuning for untrained shapes/hardness
When incorporating untrained hardness levels, the resulting mean MAPE is identified to be 1.17% (Table 7), with no substantial difference in MAPE values across various hardness levels. For untrained shapes, the observation in Table 7 aligns with the previous findings where the spherical shape yields the highest MAPE and SD. In contrast, for other shapes, the MAPE values remain acceptable, ranging from 0.54% to 1.85%.
Breakdown of GDS3DNet-7 Hardness Sensing Results According to Shape/Hardness
For each of the labeled shape/hardness columns, the shape/hardness studied has been excluded during the respective initial training.
Dynamic tuning for novel objects
All 16 complex objects were collectively assimilated into the network. By using a training/validation percentage of 10%, the mean MAPE is 6.86%, as shown in Table 8.
Breakdown of Dynamic Tuning Results for Novel Objects
For each complex shape, there are four identical objects with different hardness.
The Bear shape demonstrates a clear trend of increasing MAPE as the hardness increases, with values ranging from 1.96% for Bear-A (softest) to 9.23% for Bear-D (hardest), and a mean MAPE of 4.90%. This trend aligns with earlier findings, where the network tends to perform well for softer objects. However, the error increases for Bear-C and Bear-D, likely because harder objects exhibit less deformation.
In contrast, the Rose, Shell, and Snail shapes do not follow the same increasing MAPE trend with increasing hardness. Although the network is able to predict the hardness with a reasonable prediction accuracy, these shapes show significant variation in their MAPE values across different hardness levels, suggesting that the complexity of their geometry introduces challenges for the network. Similarly, the Bear shape has the lowest SD (4.24%), whereas the Shell and Snail shapes show higher SDs (6.68% and 6.37%, respectively).
A major contributing factor to this variability could be the presence of concave portions in the geometries of these shapes, which differ significantly from the objects seen in the training set. These features introduce nonlinearities and complex interactions between the material and its environment, which the network may not recognize. Furthermore, the dynamic tuning process only fine-tunes the network, without updating the weights in the feature extraction stage. This limits the model’s capacity to adjust to the new shape complexities, resulting in higher errors for the complex shapes.
A larger dataset for dynamic tuning would contribute to better performance by exposing the network to a more diverse range of examples. This has been validated through an experiment using different percentages for dynamic tuning (see Supplementary Data S3). With every incremental increase in percentage, there is a notable decrease in the MAPE value, indicating that the network can better capture underlying patterns and relationships. Conversely, a smaller dataset may lead to overfitting and reduced generalization.
For the everyday objects, the model achieved better performance compared to the complex objects. Their simpler geometries, free from intricate or concave features, allowed the network to process them more effectively. Certain everyday objects, such as the kiwi and stress ball, shared strong similarities in shape with the training objects.
Conclusion
We proposed and optimized GripDepthSense3DNet, a novel network designed for hardness sensing during object grasping, through a series of 3D depth images. To investigate and optimize the GripDepthSense3DNet, the used gripper is a modified version of the widely used jamming-based soft gripper where instead of having the jamming material, it integrates flexible fingers and a marker-less deformable contact membrane. The optimal depth range of 25–40 mm in the four-depth scenario demonstrated a low 0.46% MAPE. Benchmarking against state-of-the-art networks highlighted the efficiency of GripDepthSense3DNet with superior accuracy, fewer trainable parameters, and shorter training times. Enhanced stability is evident from the SD of 0.79, which is the lowest among all networks tested.
Dynamic tuning experiments demonstrated the network’s adaptability, successfully integrating untrained shapes and hardness levels with commendable results. Notably, the network exhibited consistent performance across various untrained hardnesses, leading to a mean MAPE of 1.17%. Introducing complex objects demonstrated promising adaptability, with an overall MAPE of 6.86%. However, more complex shapes, such as the Shell and Snail, tended to introduce higher errors due to the embedded intricacies.
Although our results are promising, there are some limitations to our approach. Currently, our approach works with objects within the Shore-00 hardness scale. This restriction is due to the gripper’s inability to effectively produce and capture useful deformation outside this hardness range. Additionally, the gripper cannot securely hold flat objects, as the design requires objects with some thickness to enable stable grasping. Furthermore, objects must also fit within the contact surface of the gripper. If an object exceeds these boundaries, it cannot be properly enclosed, resulting in insufficient deformation for effective depth image capture. Other than that, the depth resolution of the captured images is limited by its fixed 8-bit pixel intensity range.
For future work, we aim to develop advanced gripper designs capable of producing useful deformation beyond the Shore-00 scale. Additionally, we plan to explore alternative grasping methodologies to overcome the current limitations on object types and enhance versatility. We will also explore the potential across broader use cases, such as fruit processing (e.g., grade-based sorting) and food processing (e.g., quality assurance). To enhance the explained framework, an additional image processing algorithm could also be integrated in the future to alert the user if elasticity degradation of the gripper is detected.
Footnotes
Authors’ Contributions
T.R.L.: Developed the system, performed and analyzed the experiments, and prepared the article. B.L.J.S.: Performed and analyzed the experiments. C.P.T.: Supervised the project. S.G.N.: Conceived and supervised the project. M.A.J.: Conceived and supervised the project.
Author Disclosure Statement
No conflicts of interest to disclose.
Funding Information
This work was supported by the Fundamental Research Grant Scheme (FRGS) (Grant No. FRGS/1/2023/TK10/MUSM/02/1) provided by the Ministry of Higher Education Malaysia, and the Graduate Research Merit Scholarship from the School of Engineering, Monash University Malaysia.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
