GripDepthSense3DNet: A Depth-Enabled Hardness Sensing Framework in Soft Robotic Grasping

Abstract

Despite the development of numerous soft grippers designed to handle deformable objects, hardness sensing remains a challenge, yet it is essential for various applications such as product selection or sorting, assessing fruit ripeness, or food quality control. This research introduces GripDepthSense3DNet, an innovative approach integrating 3D depth sensing with machine learning for accurate hardness sensing during grasping. Leveraging a dataset comprising of depth images of diverse objects undergoing deformation, the proposed novel network is trained to capture intricate spatial–temporal deformation features from a series of depth images. GripDepthSense3DNet outperforms state-of-the-art networks, exhibiting a commendable mean absolute percentage error of 0.46% for trained shapes and hardness. Specifically, the model achieves a reduction in parameters of approximately 94.8% compared to ResNet-50, with a training time that is around 92.9% shorter on equivalent hardware. Different depth ranges and intervals were studied to eventually arrive at an optimal configuration. Through dynamic tuning, the network’s ability to seamlessly incorporate new shapes, new hardness, and even intricate arbitrary objects highlights the adaptability of the approach.

Introduction

Soft robotic systems are increasingly being deployed in real-world applications that demand adaptability in interacting with the environment. Achieving such capabilities requires the integration of advanced perception and manipulation techniques.

Recently, there has been a significant development of soft grippers designed to handle delicate and deformable objects. The ability to grasp and manipulate such objects is crucial, as many real-world scenarios involve items with varying degrees of deformability. Effective grasping approaches for soft grippers are commonly categorized into gripping by actuation, controlled stiffness, or controlled adhesion. Gripping by actuation can be achieved through contact-driven technology,¹ fluidic-elastomer-actuators,^2–7 or electroactive-polymers.⁸ Grippers employing controlled stiffness^9,10 are initially put in a soft configuration before the stiffening mechanism is activated to achieve object grasping. As for controlled adhesion, the two major adhesion technologies are electro-adhesion and gecko-adhesion.¹¹

Sensing plays a crucial role in the effectiveness of soft grippers, as it provides the basis of informed decision-making. This can be done through pure tactile sensing,^12–14 visual sensing using an encased 2D or 3D camera,^15–24 or force-based sensing.^25–27 Multimodal sensing is possible with the combination of two or more sensing techniques.²⁸ Visual sensing may yield a greater spatial resolution, but achieving high sensitivity to tactile interactions comparable to traditional tactile sensing poses a challenge. Besides, tracking markers may be required,^18–24 specifically when depth information is not readily available.

Object rigidity is typically assumed when object deformability poses no impediment to the task. If it becomes a significant consideration, the integration of robust capabilities for sensing deformable objects becomes imperative. Within the realm of deformation sensing, hardness/stiffness sensing^{18,19,29–32} finds practical applications in product selection or sorting, assessing fruit ripeness,^33,34 and ensuring material or food quality control.

Machine learning approaches can be applied to sensing applications such as object recognition,^14,35 deformation tracking,^36,37 and hardness/stiffness sensing.^19,26 Effective sensing may be achieved via a convolutional neural network (CNN) such as the popular AlexNet,²² ResNet series,¹⁶ VGG16,¹⁹ or even a custom one.¹⁴

Table 1 presents a collection of sensing-related studies^{15,16,19–22,25,32,38–41} employing a soft contact material. 3D CNN architectures have been employed for tactile object recognition,^38,39 whereas time-series data has shown significant utility in two-class stiffness classification.⁴⁰ However, with the exception of two studies,^19,32 the rest do not focus on single-label hardness estimation. Notably, sensors such as GelSight¹⁹ require markers on the membrane. For the soft durometer,³² compressive deflection of the embedded magnetic probe is measured. Among the vision-based studies, only two have leveraged a 3D camera,^15,16 which eliminates the need for markers on the membrane.

Table 1.

Summary of Literature Survey on Selected Object-Sensing Studies Utilizing Soft Contact Materials, Which Undergo Deformation upon Contact with Objects, Thereby Revealing Valuable Insights into the Object’s Properties

Sensor model/end effector	Function(s)	Grasping	Camera/sensing element	Contact material	Tracking markers	Deep learning	Objects tested
GripDepth Sense3DNet(proposed work)	Shape-invariant hardness estimation, with option to integrate novel complex objects through dynamic tuning for better performance	Pneumatically actuated flexible fingers	3D ToF camera: CamBoard pico flex	Black thin rubber sheet from a balloon	No	GripDepth Sense3DNet	24 training silicone objects: cubes, cylinders, hexagons, spheres 16 novel silicone objects: rose, bear, seashell, and snail 5 everyday objects
Active Perception Gripper³⁸	Object recognition	A gripper with two underactuated fingers and one fixed thumb	Tactile sensors	Silicone pad	N/A	3D TactNet	Rigid objects, deformable objects, and in-bag objects
BioTac SP and WTS-FT³⁹	Object recognition	AR-10 humanoid robotic hand	Tactile sensors	Elastomeric skin/taxels on the contact surface (vary according to the sensor used)	N/A	2D-CNN, 3D-CNN, LSTM	Spiky rubber ball, plastic ball, water bottle, metal pipe, cardboard box, sponge, roller, triangular prism, icosahedron
CySkin with Varying Soft Morphologies⁴⁰	Object classification based on geometry, surface texture, and stiffness	Not mentioned	Capacitive tactile sensors	3D-printed dielectric elastomer layers with varying morphologies	N/A	No	Eight objects presenting three main surface texture differences (geometric, texture, and elasticity)
FingerVision²¹	Proximity vision and force estimation	Can be incorporated into a parallel gripper or a Robotiq 2-finger adaptive robot	2D camera with a fisheye lens: model not specified	Casting silicone: Silicones Inc. XP-565	Yes (for force estimation)	No	Screwdriver, Coke can, business card, stuffed toy, marshmallow, pen, origami bird
GelSight Tactile Sensor¹⁹	Shape-independent hardness estimation	Possible to be attached to a rigid gripper	2D camera: model not specified	Soft elastomer	Yes	VGG16 + LSTM	Silicone objects of basic shapes, complicated shapes, and natural objects
GelSight Tactile Sensor²²	Coin recognition, slip, and shear visualization	Not mentioned	Prototype 1: Logitech C310, 5 MP resolution Prototype 2: Logitech C270, 3MP resolution	Peeled silicone sponge marked with UV dots, painted with a reflective coating	UV markers in Prototype 2.	AlexNet, optical flow algorithm	Five coin samples with different faces
Soft Bubble¹⁶	Object classification, pose estimation and tracking, and nonprehensile object manipulation	May be able to be incorporated into a parallel gripper. Research ongoing	3D ToF camera: CamBoard pico flex	Inflated 0.4 mm thick latex sheet	No	ResNet18	Hard objects: cube, robot block, frustum, triangular prism, bridge, hemisphere
Soft Durometer³²	Hardness estimation	Can be mounted to a soft gripper	Custom sensor consisting of a force transducer, hall effect transducer, and a cubic magnet	Silicone, with an embedded magnetic probe	N/A	No	12 different targets from SMOOTH-ON and test block kit from VTSYIQI
Soft Fingertip¹⁵	Contact region estimation and perception-action coupling	Not mentioned	3D ToF camera: CamBoard pico flex, and a load cell	2 mm silicone sheet of hardness approximately Shore 00–50	No	No	Sphere, cuboid, and ring
TacTip Family⁴¹	Object localization	Can be integrated into 3D-printed robot grippers	2D camera: Adafruit spy camera with a fisheye lens (for TacTip-GR2 model)	Tango Black+	Yes (pin tracking)	No	Cylinder with diameter of 25 mm
‘Universal’ Gripper²⁰	Deformation tracking	Granular-jamming-based with transparent liquid or solid filling	2D camera with a 180-degree fisheye lens: model not specified	1 mm semitransparent elastomer of hardness ≤ ShoreA-20	Yes	No	Cylinders of different length and diameter, and cuboids of different length
“Universal” Gripper²⁵	Determine location, posture, and shape of the object	Granular-jamming-based with ground coffee	Embedded conductive thermoplastic elastomer (CTPE) strain sensors	Silicone sheet	N/A	No	Pyramid, cylinder, cuboid, and hemisphere

The GelSight sensor in Yuan et al.¹⁹ has a function similar to the proposed work, but object grasping is not shown, markers on the membrane are inevitable, and the network complexity is evident. A 3D time-of-flight camera cannot be used in Sakuma et al.²⁰ due to the presence of filling. In Hughes et al.,²⁵ limited spatial resolution is observed, and sensing ability is limited to hard objects. In comparison to utilizing a soft durometer³² with a force sensor, vision-based sensing delivers comprehensive information about the entire object surface, enhancing the understanding of its hardness distribution. Conversely, the soft durometer may only provide hardness measurements at specific contact points.

In contrast, our work presents a novel vision-based approach for estimating the hardness of objects in a scene using a novel custom 3D CNN architecture designed to efficiently process depth images obtained from a single depth camera embedded inside a robotic gripper. Vision-based sensing decouples the sensing mechanism from the contact material, thereby eliminating the need for redesign or extensive recalibration of sensing elements across different applications, making it adaptable for various gripper designs. Compared to other sensing methodologies, it facilitates easier maintenance, as the sensing components experience minimal wear and tear. Besides, the proposed approach provides a nondestructive sensory-guided robotic perception, meaning that it senses object properties without causing any permanent physical alteration, damage, or wear to the objects being examined. This is achieved by fusing the domains of soft robotics and visual sensing, facilitating informed and adaptive interactions with the environment. Furthermore, it boasts higher spatial resolution compared to tactile sensors.^38–40 If required, the captured image may serve additional purposes such as object classification.

The methodology will be detailed in the section “Materials and Methods.” Section “Experiment Results and Analysis” presents both the experimental results and an in-depth analysis of the findings. Section “Conclusion” concludes the work presented.

Materials and Methods

We present a novel hardness sensing framework designed for use during the grasping of soft and deformable objects by a soft robotic gripper featuring a deformable marker-less contact membrane (Fig. 1a). Integrated within the gripper is a depth camera, capturing the deformations of the membrane resulting from the interaction between the gripper and the grasped object. Depth images are systematically captured at various depth levels throughout the grasping sequence. Our custom neural network is designed to leverage the power of deep learning in processing spatial–temporal data, enabling a specialized approach to decipher complex deformation patterns and establish correlations with the hardness values of a diverse array of objects.

FIG. 1.

(a) Experimental setup. While the gripper securely grasps an object (in this case a soft “cube”), the corresponding depth image and membrane visualization are shown on the custom gripper software. The software communicates with the depth camera and the robotic manipulator, which then sends signals to control a solenoid valve which is linked to the pneumatic air supply and the vacuum generator. The gripper has a pneumatic connection to the vacuum generator. (b) Cross-sectional illustration of the gripper grasping a soft object (not drawn to scale). During the grasping process, the flexible fingers undergo closure and conform to the shape of the object. ToF camera denotes a depth camera leveraging time-of-flight (ToF) technology to measure distances; contact membrane refers to the thin rubber sheet that deforms upon contact with the object; “sensor reading” values indicate the distances between the camera and the object, whereas “min” and “max” are used to define the “range,” which extends from the membrane’s undeformed position to the maximum allowable membrane deformation, optimizing depth resolution.

The used gripper draws inspiration from the particle jamming gripper,^9,10 a widely used soft gripper known for its capability to grasp objects of various geometries. However, we modified the gripper design to exclude the granular material inside, enabling the embedded camera to accurately observe membrane deformation. Instead, we utilize flexible fingers to securely hold the object. Our hardness sensing model can output a single hardness value (regression), rather than simply assigning a hardness class (classification). This enhancement allows for more detailed hardness sensing during object manipulation.

Used gripper design

The used gripper (Fig. 2) adopts a modular design for enhanced flexibility, cost-effectiveness, and easy maintenance, and facilitates modifications such as adjusting the number or length of fingers, the depth camera, or the robotic manipulator, with only the affected part requiring modification.

FIG. 2.

Renderings of the custom soft robotic gripper design. (a) Components of the gripper. The flexible fingers are 3D printed using thermoplastic polyurethane (TPU95) with 20% infill density, ensuring resilience. Other gripper components, including the base plate, are 3D printed using polylactic acid (PLA). The base plate ensures a secure connection to the robotic manipulator, and the sensor holder snugly houses the depth camera. The pneumatic base includes an inlet for negative air pressure supply and accommodates attachment holes for the flexible fingers. (b) Dimensions of the gripper.

A thin and opaque marker-less membrane, cut from a large balloon, is affixed to the gripper’s exterior. This contact membrane, when pressed against an object, undergoes deformation. The tension on the membrane then compels the flexible fingers to close. The membrane deformations during object grasping contain valuable information on the object’s hardness.

To ensure the fingers and membrane maintain optimal elasticity, the gripper’s custom software overlays fiducials on the live depth camera feed, outlining the desired effective contact area. This allows any degradation in elasticity to be easily detected on the software interface through visible changes in the contact area.

Fabrication and collection of soft objects

For network development and evaluation, 24 soft objects with basic shapes, 16 with intricate shapes, and several everyday objects were included, as pictured in Figure 3. Except for the everyday objects, the remaining ones were manufactured using SMOOTH-ON Ecoflex, with six varying hardness levels for the training objects, achieved by adjusting the percentage of SMOOTH-ON Slacker added. The hardness values of the training objects span from approximately 16 to 68 H00 (see Supplementary Data S1 for the ground-truth hardness values). This study is constrained to homogeneous objects, wherein the entire object is constructed from the same material. Hardness assessment is conducted using a handheld durometer compliant with the ASTM D2240 standard, featuring an accuracy of ±1 H00. The measurements are averaged from five readings taken at various points on each object.

FIG. 3.

Fabricated soft objects to ensure the diversity of the objects used in terms of shape and hardness. The choice of basic shapes (sphere, cylinder, hexagon, and cube) stems from their prevalence in benchmarking scenarios, with each shape being distinct from the others. To further challenge the network’s adaptability, more complex shapes were introduced during the dynamic tuning experiment, showcasing the network’s ability to learn and generalize to novel shapes. (a) Soft objects for training. Left to right: Hardness I to Hardness VI, refer to Supplementary Data S1 for the ground-truth hardness values. Front to back: spheres, cylinders, hexagons, and cubes. (b) Complex shapes: rose, bear, seashell, and snail (clockwise from top left). For each shape, four identical objects with different hardness have been fabricated. To enhance variability, efforts have been made to vary the hardness. (c) Everyday objects. Left to right: sapodilla, marshmallow, triangular prism cut from a sponge, kiwi, stress ball.

Fabricating intricate soft objects using silicone rubber allows for identical-looking objects with varying hardness levels. This approach is to focus solely on the potential deformation of the contact membrane, making color patterns irrelevant; the primary factor of interest lies in the objects’ differing hardness levels.

Robotic sequencing

The robot manipulator employed is the ABB-IRB-120 model. The programmed grasping sequence is shown in Figure 4. The closure of fingers is facilitated by the pneumatics subsystem, which includes a vacuum generator providing negative pressure to facilitate secure gripping.

FIG. 4.

Robotic grasping sequence with a rose-shaped soft object. (a) Initial position. (b) From the 0 mm position, that is, initial contact with the object, the gripper descends until the 40 mm position, in a continuous motion. At each depth level, spaced 1 mm apart, the robot sends a trigger signal to the gripper software to capture a depth image. The resulting depth image is generated following the procedure outlined earlier. (c) The gripper reaches maximum depth of 40 mm, and the robot’s I/O port connected to the solenoid valve is activated, compelling the fingers to remain closed. Due to their flexibility, the fingers conform to the shape of the object for effective grasping and preventing damage to delicate objects. (d) With the gripper in a grasping configuration, the robot ascends to a specified height, indicating success of robotic grasping.

The pneumatic air supply is first regulated to a constant pressure of 4 bar (approximately 58.02 psi). This air pressure remains fixed for all objects, as the flexible fingers are designed to adaptively conform to the shape of the objects, allowing for effective grasping without the need to adjust the air pressure, thereby eliminating the need for an additional feedback loop. The air supply is then linked to a solenoid valve, operated by a digital signal from the robot, whereas the valve’s output connects to the vacuum generator, inducing negative air pressure for grasping. To release the object, the vacuum supply is deactivated, causing the air pressure inside the gripper to increase, opening the fingers, and releasing the object.

For data collection, the robotic sequence resembles that of object grasping, where the gripper descends from its initial position to the 40 mm position (in increments of 1 mm), whereas the pneumatic subsystem remains inactive. Due to the tension applied by the membrane upon contact with the object, the fingers are compelled to move and close inward. Following this, the robot ascends to a holding position (without the object as grasping was not completed) and undergoes a 3° clockwise rotation. This cycle is reiterated for orientations within the first quadrant, spanning from 0° to 87°. There is no requirement to deliberately reposition the object at various locations on the platform. When the gripper makes contact with the object, it naturally aligns the object to a centralized position within the gripper. Nevertheless, the robot’s motion may inadvertently induce minor movements in the object during each cycle, introducing a certain level of variability into the collected dataset.

Data collection encompasses all objects in Figure 3. Collecting data for all 41 depths (0–40 mm) allows for the exploration of various factors, such as employing different numbers of depth levels, adjusting the depth range, and experimenting with different depth intervals.

Generation of depth images

The visual sensor used, shown at the top of Figure 1b, is the CamBoard pico flexx 3D time-of-flight (ToF) depth camera,⁴² which has a spatial resolution of 224 × 172 pixels, a depth resolution of ≤2% of distance, and a measurement range of 0.1–4 m. It has been successfully used in deformation-sensing applications.^15,16

The deformation of our gripper’s contact membrane during grasping falls within the camera’s measurement range. Using its ToF-based point cloud measurements, the depth information of the membrane can be captured without requiring markers. To achieve optimal representation of deformations in 8-bit grayscale depth images (with intensity values ranging from 0 to 255), our custom gripper software provides options to adjust threshold distances. This flexibility accommodates design changes in the gripper and ensures the depth resolution remains sensitive enough to capture fine details. A transformation process maps raw sensor readings into pixel intensity values according to a predefined depth range and direction, enhancing visualization accuracy (as depicted in Fig. 1b).

After generating the depth image, a square region of interest with dimensions of 150 × 150 pixels is cropped to center the focus on the object. Besides reducing the dimensions by 42%, this step removes unnecessary portions of the original image which could introduce unnecessary complexity and potentially confuse the neural network if retained. Next, a median filter with a kernel size of 3, the smallest effective size, is applied to suppress noise from the camera. Any additional resizing needed for input to the hardness sensing network will be handled in later stages.

Since the robotic motion only covers orientations in the first quadrant, images representative of the remaining quadrants are generated by digitally rotating the captured images. This approach significantly streamlines the data collection process using the robot manipulator. With more variations added, the total number of images is increased from 55,350 to 221,400. The increased diversity in the dataset allows the network to better generalize across different orientations, improving its performance when making predictions. Figure 5 shows a snippet of the dataset.

FIG. 5.

Examples of depth images collected. (a) Images of the hardest objects with varied shapes at different depths, where D signifies depth measured in millimeters (mm). (b) Images of a cube rotated at different angles denoted by R measured in degrees. (c) Different hardness levels, denoted by H; depth is kept constant at 40 mm.

Although subtle differences are apparent in the images of objects with the same shape but varying in hardness when the depth is kept constant (Fig. 5c), relying solely on a single depth level for hardness estimation is inadequate. This limitation arises from the model’s failure to capture the spatial–temporal features resulting from the evolutionary variations in membrane deformations during the interaction between the gripper and the object, which are related to the hardness value.

Custom network design

Although 2D CNNs have proven to be highly effective in various image recognition tasks, they may not be the most desirable choice for hardness sensing. A key limitation is their inability to naturally handle temporal dependencies in a sequence of images, potentially leading to suboptimal performance. The solution is to stack a series of depth images into a 3D input volume and perform several 3D convolution operations to extract the spatial–temporal nuances. Using a 3D input volume ( $I$ ) and a 3D filter ( $K$ ), each 3D convolution operation results in a 3D feature map ( $O$ ): $O_{Conv 3 D} (i, j, k) = \sum_{m = 0}^{M - 1} \sum_{n = 0}^{N - 1} \sum_{z = 0}^{Z - 1} (I (i + m, j + n, k + z) \cdot K (m, n, z) + b)$ (1)where $O (i, j, k)$ is the output at any position $(i, j, k)$ in the feature map, $I (i + m, j + n, k + z)$ represents the input voxel values at the location $(i + m, j + n, k + z)$ , $K (m, n, z)$ is the filter coefficient at position $(m, n, z)$ , and finally $b$ is the bias term. The summation is performed over the spatial dimensions $(M, N)$ and the temporal dimension $(Z)$ to account for the dynamic depth variations induced by the robotic motion.

For optimization, the adjustment for stride and padding is incorporated. The stride determines the step size between successive positions at which the convolution is applied, and the padding adds extra layers of zeros around the input volume to control the output dimensions.

To reduce the spatial dimensions which are notably larger than the temporal dimension, we selectively apply the max-pooling operation to the spatial dimensions, preserving the temporal features while downsampling spatial aspects.

Metrics

We employ mean absolute percentage error (MAPE) as a metric to evaluate the accuracy of hardness sensing, and standard deviation (SD) to measure the variability or dispersion of the MAPE values. The following is the equation for MAPE: $MAPE = \frac{1}{n} \sum_{j = 1}^{n} | \frac{A_{j} - P_{j}}{A_{j}} | \times 100$ (2)

Where $n$ is the number of test instances, $A_{j}$ is the actual hardness value (normalized) for the $j$ -th test instance, and likewise $P_{j}$ denotes the predicted hardness value (normalized).

Given that we employ a 10-fold cross-validation, the overall MAPE, denoted as ${MAPE}_{overall}$ , represents the mean value computed from the cross-validation process.

Investigation of GripDepthSense3DNet

We commence network development with a foundational architecture (Group A in Fig. 6a). Reducing the temporal dimension before the interpretation stage aims to distill the most essential patterns, forming a condensed representation. During the interpretation stage, the network comprehends complex features, leading to an output reflecting the estimated hardness level based on learned features.

FIG. 6.

(a) Generic structures of the 3D CNN architectures used in developmental stage. N denotes the number of homogeneous feature extraction blocks. The deliberate choice to uphold a constant count of 256 neurons in the fully connected layer provides a controlled environment to scrutinize and navigate toward an optimal feature extraction stage. Aligning with the power-of-2 convention, the selection of 256 neurons is numerically convenient and sufficient to contain the extracted features. (b) High-level architecture of GripDepthSense3DNet-7, the best variant of GripDepthSense3DNet which yields the lowest MAPE and SD values. The architecture consists of 3D convolution (Conv3D), batch normalization (BatchNorm), rectified linear unit (ReLU) activation function, 3D max-pooling (MaxPool3D), flatten, and fully connected (FC) layers. During the dynamic tuning process, only the two layers marked with an asterisk (*) will be updated. The final fully connected (FC) layer is an output layer with one neuron.

In each feature extraction block, the 3D convolution (Conv3D) layer is adept at capturing complex patterns within the depth image series. The function of max-pooling in our network is to reduce only the spatial dimension of the feature maps, retaining the most salient information while discarding less relevant details. The parameters used are detailed in Table 2.

Table 2.

GripDepthSense3DNet (GDS3DNet) Variants

Conv3D and MaxPool3D are layers of the networks. The developmental networks are categorized into Groups A, B, and C, depending on their designs, with each network assigned a unique identifier.

This table summarizes the diverse parameters, including variations in filter size (f), kernel size (k), stride (s), padding (p), and number of homogeneous convolutional blocks (N), employed across the studied GripDepthSense3DNet architectures. For instance, a kernel size of (2,3,3) in the Conv3D layer indicates a kernel size of 2 in the temporal dimension and (3,3) in the spatial dimensions. The naming convention adopted is GripDepthSense3DNet-x[sy][kz], where x denotes the total number of convolutional layers in the network, y represents the stride, and z signifies the kernel size. Note that the last two variables in the identifier (y and z) are applicable only if they distinguish one variant from another within the group, and the values of y and z do not denote the strides and kernel sizes for all convolutional layers; instead, they specifically refer to the differentiating factors. The filter sizes come in powers of 2, that is, 32, 64, 128. This practice aligns with a common approach in deep learning to leverage computational advantages while maintaining model performance.

In Figure 6a, Group B represents an improvement over Group A by incorporating additional convolutional layers to enhance the feature extraction process. Multiple configurations are explored, including the introduction of homogeneous intermediate feature extraction blocks before the final block. Emphasis is placed on adjusting the stride to expedite training time while maintaining optimal performance. Beyond the GDS3DNet-4s4 variant, the kernel size of the second convolutional layer is modified to (3, 3, 3) to improve the model’s ability to capture spatial–temporal dynamics in the depth image series, with the first dimension of the kernel representing the temporal axis, that is, capturing changes across different frames. Although this modification slightly increased training time, its effectiveness has been empirically validated through an ablation experiment, which demonstrated significant reductions in MAPE values in subsequent iterations. Without this adjustment, later iterations would have been constrained by the limited 3D feature extraction capability of the original kernel configuration. Additionally, incorporating padding in the first convolutional layer becomes essential to maintain the network’s integrity, preventing a drastic reduction in the output feature map size during the initial stages where spatial and temporal nuances might not have been adequately extracted.

In Group C (see Fig. 6a), the first two feature extraction blocks serve to capture foundational features and reduce spatial dimensions, whereas the subsequent convolutional layers progressively extract more complex patterns and high-level representations. The homogeneous feature extraction blocks are crafted to exclusively comprise Conv3D and rectified linear unit (ReLU) layers, considering that feature maps are inherently compact during this segment of the network. Introducing multiple max-pooling operations at this segment could render the convolutional layers inoperative. This design consideration aligns with the approach in the popular AlexNet⁴³ architecture, which also avoids the use of max-pooling layers in its intermediate sections.

In all variants, we progressively increase the number of filters, from the first feature extraction block to the last. As the number of filters grows, the network gains the ability to discern a broader range of features extracted. It also allows the model to learn hierarchical representations, capturing both low and high-level features that contribute to the overall understanding of the data.

GDS3DNet-7 (Fig. 6b) is selected as the optimal network, boasting the lowest overall MAPE and satisfactory training time (Table 3). The addition of convolutional layers in GDS3DNet-8 and GDS3DNet-9 did not yield improvements in either MAPE or SD, despite the heightened complexity and increased computational cost, suggesting that the introduction of additional layers would likely lead to overfitting.

Table 3.

Performance Metrics for GripDepthSense3DNet Variants

Grp	Version	Num of parameters	Training time	MAPE_overall(%)	SD_overall	Key design features
A	2	14,784,321	8 h 45 min	4.37	3.84	Basic model as the foundation
B	3	6,609,345	13 h 12 min	2.74	2.63	Enriched convolutions, added padding in the intermediate layer to retain dimensions
B	4s1	1,440,513	14 h 5 min	1.93	1.95	Comprehensive feature extraction in the intermediate section
B	4s2	391,937	3 h 50 min	1.65	1.62	Significantly reduced the training time
B	4s3	293,633	1 h 51 min	1.62	1.52	Lighter network by downsampling feature maps
B	4s4	293,633	1 h 24 min	1.66	1.51	More compact feature representation, ideal for subsequent iterations
B	4s4k3	455,425	2 h 52 min	1.56	1.45	Larger receptive field to capture broader spatial–temporal contexts
C	5	1,007,009	2 h 29 min	1.43	1.52	Improved structure for scalable iterations
C	6	1,117,665	2 h 46 min	0.78	1.12	Integrated additional convolutional layer for deeper feature hierarchies
C	7	1,228,321	3 h 27 min	0.46	0.79	Expanded depth for incremental gains
C	8	1,338,977	3 h 39 min	0.57	1.26	Multiscale feature extraction
C	9	1,449,633	3 h 56 min	0.63	1.10	Exploration on expanding the network depth

Key considerations during model development encompass the number of parameters, training time, overall MAPE (mean value computed from the 10-fold cross-validation), and standard deviation.

MAPE, mean absolute percentage error; SD, standard deviation.

A 10-fold cross-validation process ensures robust evaluation. However, the training time is recorded for a single iteration, mirroring real-world scenarios where training is typically performed once. Training is done using NVIDIA GeForce GTX 960M graphics processing unit with 8.0 GB memory. On average, the testing time for each hardness sensing instance using GDS3DNet-7 is 102 milliseconds, provided that the required images and dependency modules are preloaded. The sensing process can be integrated into pick-and-place operations. For reference, the gripping time alone for a commercially available food-grade soft gripper⁴⁴ is 0.32 s, whereas a soft gripper discussed in the literature⁴⁵ shows a range of 0.34–0.70 s.

Experiments

For consistency across all training sessions, a standardized protocol is employed, involving 400 epochs, a batch size of 10, and the Adam optimizer with a learning rate set at 0.001. The utilization of a small learning rate is imperative to facilitate the model’s gradual convergence toward the minimum loss. The choice of 400 epochs ensures that the network is sufficiently trained. The dataset undergoes randomization and is subsequently partitioned into mutually exclusive sets for training, validation, and testing. Generally, 10% of the data is reserved for the test set, whereas the remaining 90% is further split into 80% for training and 20% for validation. The model parameters will be saved only if there is an improvement in terms of validation loss. Dynamic tuning experiments employ different splitting percentages, which will be detailed in the corresponding subsections.

Performance of GripDepthSense3DNet on training objects

In the initial phase, we select the optimal GDS3DNet-7, and proceed to train it using the dataset comprising the 24 training objects shown in Figure 3a. At this juncture, the depth interval remains fixed at 5 mm. The input volume format is shown in Figure 7a. The number of depth levels is configured to four to ensure a fair comparison with state-of-the-art networks, as four subimages of 128 × 128 pixels can precisely tile a square 2D image with no unutilized pixels (Fig. 7c). The initial experiment explores four different depth ranges: 10–25 mm, 15–30 mm, 20–35 mm, and 25–40 mm.

FIG. 7.

Input volume/image formed using the depth image array. (a) Four depth images of interval 5 mm (25 mm, 30 mm, 35 mm, 40 mm) are stacked in order to form an input volume of size 128 × 128 × 4 voxels. (b) Depending on the depth interval and the number of depth levels, denoted by Z, the depth images are stacked to form an input volume of size 128 × 128 × Z voxels. (c) Four depth images of interval 5 mm are concatenated to tile a 2D image of size 256 × 256 pixels.

The next phase of optimization involves adjusting the depth interval (5, 10, and 15 mm) and the number of depth levels (ranging from three to six). The input volume format is illustrated in Figure 7b.

Benchmarking with state-of-the-art networks

The performance of GDS3DNet-7, with the optimized depth range, is benchmarked against AlexNet,⁴³ DenseNet-121,⁴⁶ Inception-v3,⁴⁷ ResNet-18, ResNet-34, ResNet-50, ResNeXt-50,⁴⁸ and ViT-T/16.⁴⁹ The inclusion of different ResNet⁵⁰ variants aims to assess the influence of network depth on hardness sensing accuracy. AlexNet is considered for its distinctive architecture, providing insights into its performance relative to more contemporary network architectures.

As they work on 2D convolution, the depth images are concatenated to form a unified input image (Fig. 7c). This maintains consistency with GDS3DNet-7, as the number of input neurons remains the same at 65,536, derived from 128 × 128 × 4 voxels. The number of parameters for GDS3DNet-7 is 1,228,321, whereas the parameter counts for the compared networks are provided in the section “Fabrication and collection of soft objects.”

Dynamic tuning for untrained shapes/hardness

For experiments related to dynamic tuning for novel shapes or hardness, each iteration involves excluding images of one specific shape or hardness from training. Upon completion of the initial training, the layers within the feature extraction stage are frozen. Once the feature extraction stage is sufficiently trained to extract intricate features related to the hardness attribute, novel elements can be integrated merely by updating the weights and biases of the fully connected layers using a small set of images. For this purpose, 90% of the excluded images are designated as the test set, with 5% assigned to the validation set and another 5% to the training set for dynamic tuning.

Dynamic tuning for novel objects

Similarly, when introducing complex-shaped objects or everyday objects with untrained shapes and hardness into the network, dynamic tuning can be achieved by utilizing a small set of the new images. This method obviates the need for retraining the entire network, allowing for targeted tuning tailored to applications with specific objects.

Experiment Results and Analysis

The performance and optimization of GripDepthSense3DNet are showcased, along with a comparative analysis against state-of-the-art networks. Following this, dynamic tuning results are detailed.

Performance of GripDepthSense3DNet on training objects

In Table 4, when comparing the depth ranges of 10–25 mm and 15–30 mm. No improvement is noted, suggesting the object deformation within these ranges is not substantial for effective recognition of hardness variations.

Table 4.

Evaluation of Hardness Sensing Using GDS3DNet-7 Under Varying Depth Ranges

Depths (mm)	MAPE_overall (%)	SD_overall
10, 15, 20, 25	1.18	3.62
15, 20, 25, 30	1.44	3.06
20, 25, 30, 35	0.55	1.34
25, 30, 35, 40	0.46	0.79

For the sake of consistent comparisons, the depth interval is fixed at 5 mm, and the number of depth levels is maintained at four.

A notable improvement is witnessed when transitioning to 20–35 mm, resulting in a remarkable 62% decrease in MAPE. This trend of improvement continues when shifting the depth range to 25–40 mm, yielding the lowest MAPE of 0.46%. Likewise, the SD for this range is the lowest among all cases at 0.79, indicating enhanced stability and consistency. Greater depths are preferred likely due to the increased deformation compared to shallower depths, which enhances the network’s ability to discern and identify hardness variations.

Figure 8a shows the ground-truth hardness values of different objects. With a depth range of 25–40 mm, Figure 8b demonstrates the model’s output values closely mirroring the actual hardness values, indicating a strong correlation between the model’s predictions and the ground-truth data. Figure 8c and d present the breakdown of results based on shapes and hardness. The analysis reveals that the MAPE is highest for the sphere, followed by the hexagon, cube, and cylinder. The elevated MAPE for the sphere can be attributed to its curved nature, aligning with the curvature of the membrane when the fingers close, yielding significantly less deformation. Concerning hardness, a consistent and descending MAPE trend is noted from Hardness I to Hardness VI. This reiterates that increased deformation enhances hardness sensing capabilities, as softer objects exhibit more pronounced object deformation during grasping.

FIG. 8.

(a) Ground-truth hardness values. (b) Predictions made using GDS3DNet-7, when both shapes and hardness are included in training. Red line illustrates the ideal scenario. (c, d) Breakdown of GDS3DNet-7’s results according to MAPE and SD, where “M” denotes the mean values. The heatmap scale ranges from the local minimum to maximum values to emphasize the variation across the results.

From Table 5, although the lowest MAPE is achieved using three depths with a 15 mm interval, we opt for the configuration of four depths with a 5 mm interval for the ensuing experiments. One reason is that it yields the lowest SD value while maintaining a relatively low MAPE value, emphasizing a trade-off between optimizing accuracy and minimizing variance. Another rationale for maintaining four depths is to ensure consistency in input data size for benchmarking against state-of-the-art networks.

Table 5.

Configurations for Network Optimizations, Along with the Corresponding Results in Terms of Overall MAPE and Standard Deviation

Depth interval (mm)	Num of depths	Depths (mm)	MAPE_overall (%)	SD_overall
5	3	30, 35, 40	0.71	1.09
	4	25, 30, 35, 40	0.46	0.79
	5	20, 25, 30, 35, 40	0.60	1.21
	6	15, 20, 25, 30, 35, 40	0.42	0.91
10	3	20, 30, 40	0.48	1.09
	4	10, 20, 30, 40	0.46	1.22
15	3	10, 25, 40	0.41	1.03

Building on the knowledge gained earlier that greater depths are preferred, the depth levels are selected such that 40 mm is the final depth.

Although the aforementioned results were obtained using a 10-finger gripper, the robustness of GripDepthSense3DNet has been validated by executing hardness sensing with a 5-finger gripper, using the optimal depth configuration of 4 depths and 5 mm interval. The resulting MAPE is found to be 0.36%, indicating that the network could be trained with a different gripper configuration (see Supplementary Data S2).

Benchmarking with state-of-the-art networks

In Figure 9a, GDS3DNet-7 demonstrates accelerated convergence to the minima when compared to alternative architectures. Upon reaching a steady state, the error values consistently maintain a lower profile than those of other models. Referring to Figure 9b and Table 6, GDS3DNet-7 has the lowest MAPE and SD values, as well as the shortest training time. A higher number of parameters does not guarantee improved performance, such as in the case of AlexNet, as this abundance of parameters may lead to overfitting and inefficiencies in learning the hardness-related features. Compared to ResNet-50, which achieves a MAPE of 0.60% as the best-performing state-of-the-art network in the comparison, GDS3DNet-7 delivers an impressive 94.8% reduction in parameters while reducing training time by approximately 92.9% on equivalent hardware. Figure 10 presents a detailed breakdown of results based on object shapes and hardness levels, for several comparable networks.

FIG. 9.

(a) Training curves for GDS3DNet-7 and several comparable state-of-the-art networks (only one ResNet variant is shown) with y-axis limited to a maximum of 10 for clarity. The displayed curves are best-fit lines illustrating reductions in training losses. Note that the best model parameters for each network are selected based on the epoch with the lowest validation loss. (b) MAPE and SD values of the comparable networks, including the developmental GDS3DNet variants.

FIG. 10.

Breakdown of several comparable state-of-the-art networks’ results according to MAPE and SD, where “M” denotes the mean values. For conciseness, only the best-performing ResNet variant is shown, and the remaining networks for this figure are chosen based on their performance. Results for GDS3DNet-7 have been shown earlier. (a, b) AlexNet. (c, d) DenseNet-121. (e, f) Inception-v3. (g, h) ResNet-50.

Table 6.

Performances of GripDepthSense3DNet-7 and State-of-the-Art Networks

Network	Number of parameters	Training time	MAPE_overall (%)	SD_overall
GDS3DNet-7	1,228,321	3 h 27 min	0.46	0.79
AlexNet	75,989,009	12 h 50 min	1.22	1.67
DenseNet-121	7,096,769	Unavailable^a (Memory Constraint)	1.18	1.27
Inception-v3	21,875,937	42 h 20 min	1.25	2.69
ResNet-18	11,180,353	18 h 05 min	1.04	1.02
ResNet-34	21,295,937	30 h 05 min	0.89	2.43
ResNet-50	23,583,489	48 h 36 min	0.60	1.28
ResNeXt-50	23,043,905	Unavailable^a (Memory Constraint)	2.09	2.30
ViT-T/16	5,486,977	41 h 59 min	3.37	3.66

The ResNet variant exhibiting the closest accuracy to GDS3DNet-7 is ResNet-50, albeit still having a higher overall MAPE.

For DenseNet-121 and ResNeXt-50, the training times were not recorded as the training could not be completed on the same local computer due to memory limitations. Consequently, the MAPE and SD values were obtained using a high-performance computing (HPC) platform with 64 GB of memory. As these two networks perform more poorly overall, the training time is not considered an essential marker for them.

Dynamic tuning for untrained shapes/hardness

When incorporating untrained hardness levels, the resulting mean MAPE is identified to be 1.17% (Table 7), with no substantial difference in MAPE values across various hardness levels. For untrained shapes, the observation in Table 7 aligns with the previous findings where the spherical shape yields the highest MAPE and SD. In contrast, for other shapes, the MAPE values remain acceptable, ranging from 0.54% to 1.85%.

Table 7.

Breakdown of GDS3DNet-7 Hardness Sensing Results According to Shape/Hardness

Hardness	I	II	III	IV	V	VI	Mean
MAPE (%)	0.89	0.44	1.88	1.08	2.06	0.69	1.17
SD	0.79	0.50	2.11	1.14	2.36	0.61	1.25

Shape	Cube	Cylinder	Hex	Sphere	Mean
MAPE (%)	1.85	0.54	1.01	8.85	3.06
SD	2.33	0.62	1.33	7.99	3.07

For each of the labeled shape/hardness columns, the shape/hardness studied has been excluded during the respective initial training.

Dynamic tuning for novel objects

All 16 complex objects were collectively assimilated into the network. By using a training/validation percentage of 10%, the mean MAPE is 6.86%, as shown in Table 8.

Table 8.

Breakdown of Dynamic Tuning Results for Novel Objects

Complex-shaped objects
Object	A	B	C	D	Mean
Bear
Hardness (H00)	28	31	46	61	—
MAPE (%)	1.96	2.19	6.21	9.23	4.90
SD	2.12	2.49	5.45	6.91	4.24
Rose
Hardness (H00)	25	31	46	62	—
MAPE (%)	8.12	11.11	5.85	4.21	7.32
SD	7.26	6.16	5.34	4.10	5.72
Shell
Hardness (H00)	29	43	47	57	—
MAPE (%)	8.09	10.70	5.72	6.45	7.74
SD	9.92	6.59	4.88	5.34	6.68
Snail
Hardness (H00)	29	39	47	57	—
MAPE (%)	11.77	4.30	7.64	6.24	7.49
SD	10.48	4.18	5.38	5.44	6.37
Mean
MAPE (%)	7.49	7.07	6.36	6.53	6.86
SD	7.44	4.86	5.26	5.45	5.75

Everyday objects
Object	Sponge-triangle	Marshmallow	Sapodilla	Kiwi	Stress Ball	Mean
Hardness (H00)	20	30	37	51	62	—
MAPE (%)	2.17	1.12	2.96	1.63	1.25	1.82
SD	4.58	2.06	6.17	2.80	1.37	3.39

For each complex shape, there are four identical objects with different hardness.

The Bear shape demonstrates a clear trend of increasing MAPE as the hardness increases, with values ranging from 1.96% for Bear-A (softest) to 9.23% for Bear-D (hardest), and a mean MAPE of 4.90%. This trend aligns with earlier findings, where the network tends to perform well for softer objects. However, the error increases for Bear-C and Bear-D, likely because harder objects exhibit less deformation.

In contrast, the Rose, Shell, and Snail shapes do not follow the same increasing MAPE trend with increasing hardness. Although the network is able to predict the hardness with a reasonable prediction accuracy, these shapes show significant variation in their MAPE values across different hardness levels, suggesting that the complexity of their geometry introduces challenges for the network. Similarly, the Bear shape has the lowest SD (4.24%), whereas the Shell and Snail shapes show higher SDs (6.68% and 6.37%, respectively).

A major contributing factor to this variability could be the presence of concave portions in the geometries of these shapes, which differ significantly from the objects seen in the training set. These features introduce nonlinearities and complex interactions between the material and its environment, which the network may not recognize. Furthermore, the dynamic tuning process only fine-tunes the network, without updating the weights in the feature extraction stage. This limits the model’s capacity to adjust to the new shape complexities, resulting in higher errors for the complex shapes.

A larger dataset for dynamic tuning would contribute to better performance by exposing the network to a more diverse range of examples. This has been validated through an experiment using different percentages for dynamic tuning (see Supplementary Data S3). With every incremental increase in percentage, there is a notable decrease in the MAPE value, indicating that the network can better capture underlying patterns and relationships. Conversely, a smaller dataset may lead to overfitting and reduced generalization.

For the everyday objects, the model achieved better performance compared to the complex objects. Their simpler geometries, free from intricate or concave features, allowed the network to process them more effectively. Certain everyday objects, such as the kiwi and stress ball, shared strong similarities in shape with the training objects.

Conclusion

We proposed and optimized GripDepthSense3DNet, a novel network designed for hardness sensing during object grasping, through a series of 3D depth images. To investigate and optimize the GripDepthSense3DNet, the used gripper is a modified version of the widely used jamming-based soft gripper where instead of having the jamming material, it integrates flexible fingers and a marker-less deformable contact membrane. The optimal depth range of 25–40 mm in the four-depth scenario demonstrated a low 0.46% MAPE. Benchmarking against state-of-the-art networks highlighted the efficiency of GripDepthSense3DNet with superior accuracy, fewer trainable parameters, and shorter training times. Enhanced stability is evident from the SD of 0.79, which is the lowest among all networks tested.

Dynamic tuning experiments demonstrated the network’s adaptability, successfully integrating untrained shapes and hardness levels with commendable results. Notably, the network exhibited consistent performance across various untrained hardnesses, leading to a mean MAPE of 1.17%. Introducing complex objects demonstrated promising adaptability, with an overall MAPE of 6.86%. However, more complex shapes, such as the Shell and Snail, tended to introduce higher errors due to the embedded intricacies.

Although our results are promising, there are some limitations to our approach. Currently, our approach works with objects within the Shore-00 hardness scale. This restriction is due to the gripper’s inability to effectively produce and capture useful deformation outside this hardness range. Additionally, the gripper cannot securely hold flat objects, as the design requires objects with some thickness to enable stable grasping. Furthermore, objects must also fit within the contact surface of the gripper. If an object exceeds these boundaries, it cannot be properly enclosed, resulting in insufficient deformation for effective depth image capture. Other than that, the depth resolution of the captured images is limited by its fixed 8-bit pixel intensity range.

For future work, we aim to develop advanced gripper designs capable of producing useful deformation beyond the Shore-00 scale. Additionally, we plan to explore alternative grasping methodologies to overcome the current limitations on object types and enhance versatility. We will also explore the potential across broader use cases, such as fruit processing (e.g., grade-based sorting) and food processing (e.g., quality assurance). To enhance the explained framework, an additional image processing algorithm could also be integrated in the future to alert the user if elasticity degradation of the gripper is detected.

Footnotes

Authors’ Contributions

T.R.L.: Developed the system, performed and analyzed the experiments, and prepared the article. B.L.J.S.: Performed and analyzed the experiments. C.P.T.: Supervised the project. S.G.N.: Conceived and supervised the project. M.A.J.: Conceived and supervised the project.

Author Disclosure Statement

No conflicts of interest to disclose.

Funding Information

This work was supported by the Fundamental Research Grant Scheme (FRGS) (Grant No. FRGS/1/2023/TK10/MUSM/02/1) provided by the Ministry of Higher Education Malaysia, and the Graduate Research Merit Scholarship from the School of Engineering, Monash University Malaysia.

Supplementary Materials

References

Ali

, Zhanabayev

, Khamzhin

, et al. Biologically inspired gripper based on the fin ray effect. In: 2019 5th International Conference on Control, Automation and Robotics (ICCAR) IEEE; 2019; pp. 865–869.

Ltd. FCo. FlexShapeGripper. 2017. Available from: https://www.festo.com/group/en/cms/10217.htm

Paek

, Cho

, Kim

. Microrobotic tentacles with spiral bending capability based on shape-engineered elastomeric microtubes. Sci Rep, 2015; 5(1):10768–10711.

, Stampfli

, Xu

, et al. A vacuum-driven origami “Magic-Ball” soft gripper. In: 2019 International Conference on Robotics and Automation (ICRA) IEEE; 2019; pp. 7401–7408.

Hao

, Gong

, Xie

, et al. Universal soft pneumatic robotic gripper with variable effective length. In: 2016 35th Chinese Control Conference (CCC) IEEE; 2016; pp. 6109–6114.

Zhou

, Chen

, Wang

. A soft-robotic gripper with enhanced object adaptation and grasping reliability. IEEE Robot Autom Lett, 2017; 2(4):2287–2293.

Suzumori

, Iikura

, Tanaka

. Development of flexible microactuator and its applications to robotic mechanisms. In: Proceedings. 1991 IEEE International Conference on Robotics and Automation IEEE Computer Society; 1991; pp. 1622–1623.

Carpi

, De Rossi

, Kornbluh

, et al. Dielectric Elastomers as Electromechanical Transducers: Fundamentals, Materials, Devices, Models and Applications of an Emerging Electroactive Polymer Technology. Elsevier; 2011.

Brown

, Rodenberg

, Amend

, et al. Universal robotic gripper based on the jamming of granular material. Proc Natl Acad Sci USA, 2010; 107(44):18809–18814.

10.

Amend

, Cheng

, Fakhouri

, et al. Soft robotics commercialization: Jamming grippers from research to product. Soft Robot, 2016; 3(4):213–222.

11.

Lee

, Lee

, Messersmith

. A reversible wet/dry adhesive inspired by mussels and geckos. Nature, 2007; 448(7151):338–341.

12.

Delgado

, Jara

, Torres

. Adaptive tactile control for in-hand manipulation tasks of deformable objects. Int J Adv Manuf Technol, 2017; 91(9–12):4127–4140.

13.

Drimus

, Kootstra

, Bilberg

, et al. Classification of rigid and deformable objects using a novel tactile sensor. In: 2011 15th International Conference on Advanced Robotics (ICAR) IEEE; 2011; pp. 427–434.

14.

Huang

, Rosendo

. Variable stiffness object recognition with a cnn-bayes classifier on a soft gripper. Soft Robot, 2022; 9(6):1220–1231.

15.

Huang

, Liu

, Bajcsy

. A depth camera-based soft fingertip device for contact region estimation and perception-action coupling. In: 2019 International Conference on Robotics and Automation (ICRA) IEEE; 2019; pp. 8443–8449.

16.

Alspach

, Hashimoto

, Kuppuswamy

, et al. Soft-Bubble: A highly compliant dense geometry tactile sensor for robot manipulation. In: 2019 2nd IEEE International Conference on Soft Robotics (RoboSoft) IEEE; 2019; pp. 597–604.

17.

Kuppuswamy

, Castro

, Phillips-Grafflin

, et al. Fast model-based contact patch and pose estimation for highly deformable dense-geometry tactile sensors. IEEE Robot Autom Lett, 2020; 5(2):1811–1818; doi: 10.1109/LRA.2019.2961050

18.

Yuan

, Srinivasan

, Adelson

. Estimating object hardness with a gelsight touch sensor. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) IEEE; 2016; pp. 208–215.

19.

Yuan

, Zhu

, Owens

, et al. Shape-independent hardness estimation using deep learning and a gelsight tactile sensor. In: 2017 IEEE International Conference on Robotics and Automation (ICRA) IEEE; 2017; pp. 951–958.

20.

Sakuma

, Von Drigalski

, Ding

, et al. A universal gripper using optical sensing to acquire tactile information and membrane deformation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) IEEE; 2018; pp. 1–9.

21.

Yamaguchi

, Atkeson

. Implementing tactile behaviors using fingervision. In: 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids) IEEE; 2017; pp. 241–248.

22.

Abad

, Ranasinghe

. Low-cost GelSight with UV markings: Feature extraction of objects using AlexNet and optical flow without 3D image reconstruction. In: 2020 IEEE International Conference on Robotics and Automation (ICRA) IEEE; 2020; pp. 3680–3685.

23.

Hristu

, Ferrier

, Brockett

. The performance of a deformable-membrane tactile sensor: Basic results on geometrically-defined tasks. In: Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065) IEEE; 2000; pp. 508–513.

24.

Ferrier

, Brockett

. Reconstructing the shape of a deformable membrane from image data. International Journal of Robotics Research, 2000; 19(9):795–816; doi: 10.1177/02783640022067184

25.

Hughes

, Iida

. Tactile sensing applied to the universal gripper using conductive thermoplastic elastomer. Soft Robot, 2018; 5(5):512–526.

26.

Bednarek

, Kicki

, Bednarek

, et al. Gaining a sense of touch object stiffness estimation using a soft gripper and neural networks. Electronics (Basel), 2021; 10(1):96.

27.

Wang

, Hirai

. A 3D printed soft gripper integrated with curvature sensor for studying soft grasping. In: 2016 IEEE/SICE International Symposium on System Integration (SII) IEEE; 2016; pp. 629–633.

28.

Arriola-Rios

, Wyatt

. A multimodal model of object deformation under robotic pushing. IEEE Trans Cogn Dev Syst, 2017; 9(2):153–169.

29.

Frank

, Schmedding

, Stachniss

, et al. Learning the elasticity parameters of deformable objects with a manipulation robot. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems IEEE; 2010; pp. 1877–1883.

30.

Fugl

, Jordt

, Petersen

, et al. Simultaneous estimation of material properties and pose for deformable objects from depth and color images. In: Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium Springer; 2012; pp. 165–174.

31.

Guler

, Pauwels

, Pieropan

, et al. Estimating the deformability of elastic materials using optical flow and position-based dynamics. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids) IEEE; 2015; pp. 965–971.

32.

Oliveira

, Vaidya

, Padir

, et al. A soft durometer for tactile sensing. In: 2021 IEEE 4th International Conference on Soft Robotics (RoboSoft) IEEE; 2021; pp. 427–434.

33.

Nassiri

, Tahavoor

, Jafari

. Fuzzy logic classification of mature tomatoes based on physical properties fusion. Information Processing in Agriculture, 2022; 9(4):547–555.

34.

Scimeca

, Maiolino

, Cardin-Catalan

, et al. Non-Destructive robotic assessment of mango ripeness via multi-point soft haptics. In: 2019 International Conference on Robotics and Automation (ICRA) IEEE; 2019; pp. 1821–1826.

35.

Sun

, Wu

, Zhao

, et al. Object recognition and grasping for collaborative robots based on vision. Sensors, 2023; 24(1):195.

36.

Cretu

A-M

, Payeur

, Petriu

. Soft object deformation monitoring and learning for model-based robotic hand manipulation. IEEE Trans Syst Man Cybern B Cybern, 2012; 42(3):740–753.

37.

Scimeca

, Hughes

, Maiolino

, et al. Model-free soft-structure reconstruction for proprioception using tactile arrays. IEEE Robot Autom Lett, 2019; 4(3):2479–2484.

38.

Pastor

, Gandarias

, García-Cerezo

, et al. Using 3D convolutional neural networks for tactile object recognition with robotic palpation. Sensors, 2019; 19(24):5356.

39.

Bottcher

, Machado

, Lama

, et al. Object recognition for robotics from tactile time series data utilising different neural network architectures. In: 2021 International Joint Conference on Neural Networks (IJCNN) IEEE; 2021; pp. 1–8.

40.

Scimeca

, Maiolino

, Iida

. Efficient bayesian exploration for soft morphology-action co-optimization. In: 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft) IEEE; 2020; pp. 639–644.

41.

Ward-Cherrier

, Pestell

, Cramphorn

, et al. The tactip family: Soft optical tactile sensors with 3d-printed biomimetic morphologies. Soft Robot, 2018; 5(2):216–227.

42.

Automation24. Development Kit Pmd[Vision]® CamBoard Pico Flexx - 700.000.094. 2021. Available from: https://www.automation24.com/development-kit-pmd-vision-r-camboard-pico-flexx-700-000-094 [Last accessed: November 8, 2024].

43.

Krizhevsky

, Sutskever

, Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM, 2017; 60(6):84–90.

44.

ROS Components. Soft Gripper. n.d. Available from: https://www.roscomponents.com/en/grippers/soft-gripper-on-robot [Last accessed: November 16, 2024].

45.

Meng

, Gerez

, Chapman

, et al. A tendon-driven, preloaded, pneumatically actuated, soft robotic gripper with a telescopic palm. In: 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft) IEEE; 2020; pp. 476–481.

46.

Huang

, Liu

, Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017; pp. 4700–4708.

47.

Szegedy

, Vanhoucke

, Ioffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; pp. 2818–2826.

48.

Xie

, Girshick

, Dollár

, et al. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017; pp. 1492–1500.

49.

Alexey

. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2020 preprint arXiv: 201011929.

50.

, Zhang

, Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; pp. 770–778.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.11 MB

0.06 MB

0.00 MB