Abstract
Robotic grasping in dynamic environments is still one of the main challenges in automation tasks. Advances in deep learning methods and computational power suggest that the problem of robotic grasping can be solved by using a huge amount of training data and deep networks. Despite these huge accomplishments, the acceptance and usage in real-world scenarios is still limited. This is mainly due to the fact that the collection of the training data is expensive, and that the trained network is a black box. While the collection of the training data can sometimes be facilitated by carrying it out in simulation, the trained networks, however, remain a black box. In this study, a three-step model is presented that profits both from the advantages of using a simulation approach and deep neural networks to identify and evaluate grasp points. In addition, it even offers an explanation for failed grasp attempts. The first step is to find all grasp points where the gripper can be lowered onto the table without colliding with the object. The second step is to determine, for the grasp points and gripper parameters from the first step, how the object moves while the gripper is closed. Finally, in the third step, for all grasp points from the second step, it is predicted whether the object slips out of the gripper during lifting. By this simplification, it is possible to understand for each grasp point why it is stable and – just as important – why others are unstable or not feasible. All of the models employed in each of the three steps and the resulting Overall Model are evaluated. The predicted grasp points from the Overall Model are compared to the grasp points determined analytically by a force-closure algorithm, to validate the stability of the predicted grasps.
Introduction
Robots have already been successfully established in many areas of industry. Mobile robots are used extensively in the logistics of big warehouses, and cooperating robots are used as adaptive assistance systems for human workers. However, for the picking of unsorted product parts, still no fully automated solution is available on a large scale, since the variability of the working environment and the (delicate) motor interaction therein is subject to higher fluctuations than in the aforementioned areas. Furthermore, robotic grasping involves several individual tasks like object classification, object segmentation, pose estimation and path planning, which are huge research tasks on their own. The intense use of machine learning methods and especially deep learning for image processing have boosted the progress in the domain of robotic grasping. In the following, the subtasks, the applied state-of-the-art machine learning methods and their shortcomings are introduced.
State of the art
Robotic grasping is the task of finding poses of a robotic end-effector, such that any given object can be picked up successfully [1]. These poses are called grasping poses/postures. Vision-based robotic grasping uses data from vision sensors for the identification of grasp postures. In vision-based robotic grasping, the first task is to process the sensory data like RGB images, depth images, stereo camera images or 3D-point clouds to determine the shape(s) and pose(s) of the perceived object(s). A review on deep learning based object detection can be found in Zhao et al. and Liu et al. [2, 3, 4]. The second step is to determine, from the information obtained from the sensory data, the desired pose of the robot arm with respect to the object such that the object is stable between the gripper tips. The stability of a grasp can be measured by different metrics as compared in Rubert et al. [5]. The third step is to determine a collision-free path from the actual to the desired pose of the robot arm [6, 7]. In each of these steps, an increased usage of machine learning algorithms and especially deep learning with convolutional layers can be observed. An extensive review of deep learning methods used for robotic grasp detection can be found in [1]. Mainly two approaches are distinguished in robotic grasping [1, 8]: The first approach determines from the sensory image(s) the optimal grasp pose of the robot arm either analytically with force and form-closure objectives [9] or data-driven. A grasp planner determines subsequently the motor commands to drive the robot arm into the optimal grasp pose. In the second approach, the visuomotor controller is directly learned from the sensory image(s) as one monolithic model.
In the first approach, machine learning algorithms are primary used for the detection of objects on images and for assessing grasp configurations. As an example, Jiang et al. [10] learned a rectangle representation of a grasp from hand-engineered features on RGB images, where the quality of the grasp was assessed with a support vector machine. Lenz et al. [11] have extended this approach and substituted the hand-engineered features as well as the quality determination of the grasp rectangles with deep learning models. Ogas et al. [12] have used a convolutional neural network to detect a given production piece on images of a monocular camera. For the determination of stable grasp points on the detected object shape they used the Hough transformation [13] and applied force closure to the extracted line-segments.
State-of-the-art grasping methods belong to the second approach and apply deep learning [14, 15, 16, 17] to train a visuomotor controller. Morrison et al. [16] predict for each pixel of a single depth image with a convolutional neural network the gripper width, gripper angle and grasp quality. Levine et al. [15] trained a deep neural network that predicts from camera images and a robot motor command the grasp success probability. An optimization algorithm is applied to determine the motor command that maximizes the grasp success probability. In order to train this network, 800 000 training data points had to be recorded. Like for most models that involve deep networks, the amount of data needed, in order to train the network successfully, is immense. Alternative approaches use a physics simulation of a robot arm and the objects to obtain the training data. The reality gap, which simply speaking is the difference between the simulation and the real world, is overcome by the principle of domain randomization [18]. The prevalent datasets that are used to train and evaluate convolutional neural networks for robotic grasp models are the Cornell [11] and Jacquard [19] datasets. Both datasets consist of RGB-D images of an object and grasp rectangles which represent the possible projection areas for a parallel-jaw gripper. The Jacquard dataset is completely generated in a simulation.
Problem statement and proposal
One problem that all of the state-of-the-art approaches share, is that the evaluation of a robotic grasp is only given in terms of grasp success probabilities. More precisely, these methods learn a mapping which assigns to each pixel on an object image a probability for a successful grasp. This assignment is done via a CNN, for which it is not clear which human-readable features lead to the prediction. This problem is known under the term “black box” and the solution strategies are summarized under the term explainable AI (XAI) [20, 21, 22]. As the decisions that are met by the neural networks are difficult to comprehend, the application in autonomous industrial environments is hardly possible.
Flowchart of the proposed three-step model.
The contribution of this study is a three-step model to identify and evaluate grasp points. Therefore, the problem of robotic grasping is reformulated as finding the grasp postures on an image where the gripper can be lowered, closed and lifted. In contrast to the direct mapping from an image to a grasp success probability, the user of the model can identify the reasons of failed grasp attempts and, in addition, assess the quality of the predictions in each individual step. This assessment is possible since the user can at least partly evaluate the results in the visual domain based on his own experience. When the results coincide with his expectations, the trust of the user into the model increases. Furthermore, the proposed three-step model profits both from the advantages of using a simulation approach and deep neural networks. Figure 1 shows a flowchart of the proposed model. In a first step a predefined number of random grasp candidates is generated on an object image. All grasp points, where the gripper can be lowered on the table while enclosing an object part, are determined by the Lowering Model. Therefore a neural network (NN) is trained to generate an image of the gripper for parameters
Another advantage of this new approach over the existing methods is, that the dynamics between the gripper and the object is taken directly into account in the Closing Model. To the best of our knowledge there are no grasping models where the object movement due to the closing movement of the gripper is directly predicted. There is research on predicting object movements in pushing tasks directly in the image domain [23, 24, 25], but not in the realm of grasping. In [26, 27] the effect of a force acting on an object in an image is predicted in terms of a vector field but without the prediction in the image domain.
In the following, first the materials and methods, i.e. the underlying robot simulation and used object database, are described. Then a description of the algorithm for the determination of stable grasp points is given. Afterwards, the training and evaluation of the Lowering, Closing and Lifting Model are explained. Finally, the performance of the chained Overall Model is presented in detail, and the predicted grasp points from the Overall Model are compared to the grasp points from the analytical force closure algorithm to show and verify that the determined grasp points are stable grasp points.
Robot simulation and object database
The gripper under consideration is a parallel-jaw gripper with the gripper opening width
Left: V-REP simulation of the gripper in pre-grasp position. The position and orientation of the world coordinate system is indicated by the grey circle and the coordinate axes. The orthographic gripper camera and the orthographic ceiling camera, centred above the table are shown as grey boxes. Right: Image from the orthographic gripper camera in the pre-grasp position. Adapted from [28].
Parallel-jaw gripper with gripper parameters 
A grasp is executed perpendicular to the table surface at the world coordinates
[ht] DetermineStableGrasps.
In this section the proposed algorithm is described. The pseudo-code is given in Algorithm 3. This algorithm is used to determine stable grasp postures (
Thus, there are three possible causes why a chosen grasp candidate does not lead to a stable grasp. The first reason is that there is no gripper parameter configuration where the gripper can be lowered onto the table without colliding with the object. This possibly means that the object is too big for the used gripper at the grasp location. The second reason is that for the given grasp candidate, the object was pushed out of the gripper during closing due to the local shape of the object. Finally, the last reason for a failed grasp is that the object falls out of the gripper during lifting, because of the distance between the grasp candidate and the centre of mass of the object. The independent evaluation of these three steps allows the attribution of grasping failures to distinct causes in contrast to competing holistic approaches.
Figure 19 shows a sample workflow for one object image. The red dots on the object indicate the randomly chosen grasp candidates, the green dots the grasp candidates that have been translated by
translate the object on the orthographic camera image such that the grasp candidate (translated by rotate the object around the centre by predict the gripper projection for ( scale the image by a factor of 5.3 to make the transformation from the ceiling camera to the gripper camera centre crop a region of 250 separate the image into the gripper and object channels by pixel intensities
The last step yields the input images of the Closing Model.
In the following sections, the implementation of the Lowering, Closing and Lifting Model are described in detail, before the Overall Model is evaluated in an end-to-end fashion.
The first step of grasping is to lower the gripper tips on the table without colliding with the object. Therefore, a binary orthographic image is taken from the ceiling camera. In this binary image, grasp candidate are chosen randomly on the object shape. Since the size of the objects is restricted, the maximum number of grasp candidate is limited to 30. Furthermore, to guarantee an even distribution of the points, the grasp candidate are forced to have a distance of at minimum 5 pixels which corresponds to 1 cm. For each grasp candidate an image patch is cropped with a size of 51
Structure of the extreme learning machine for the gripper image prediction.
An extreme learning machine (ELM) is a single-hidden-layer feedforward neural network. By using only one hidden layer and keeping the weights between input and hidden layer fixed, the network training can be formulated as a simple matrix problem, in contrast to the backpropagation algorithm for networks with more than one hidden layer. This makes the networks’ training time extremely fast. For a detailed description, it is referred to the original work of Huang [31]. In the case of the gripper image prediction, the input of the ELM is the gripper parameter tuple (
which is the number of wrongly predicted pixels. The overall error for one validation set is then the average distance over the total number of patterns (S)
The performance of the ELM is also tested on novel data. Since
Gripper image of maximal error (26px) in test patterns for 
Schematic view of the optimization problem of lowering the gripper.
After training the ELM and obtaining the image patches for an object with the grasp candidate at the centre, the task of finding the collision-free gripper parameters can be formulated as optimization problem as depicted in Fig. 6: for each image patch with the grasp candidate at the centre, find the parameters (
Top: Illustration of the effect of the translation 
Example for the approximation of the Pareto front by GDE3, NSGA-II and NSGA-III for the shown image patch. The image for the gripper tips determined from the last population of GDE3 is exemplary shown.
In summary, the Lowering Model is a bi-objective optimization problem with the two objective functions
In the following, three different popular Pareto multiobjective optimization algorithms [33, 34] are compared in order to determine the solutions of the Lowering Model: generalized differential evolution 3 (GDE3) [35], non-dominated sorting genetic algorithm II (NSGA-II) [36] and non-dominated sorting genetic algorithm III (NSGA-III) [37]. Starting from an inital generation of vectors in the domain of the problem, the goal of all algorithms is to iteratively find the generation members that approximate the Pareto front [38] in an evenly distributed fashion. The Pareto front is the unknown set of solutions of the optimization problem, where the goal of one objective function can only be improved by worsening the result of the other objective [33]. The compared algorithms can be separated into two groups: while the goal of GDE3 and NSGA-II is to approximate the Pareto front [33] in a uniform fashion, NSGA-III uses predefined reference vectors as additional information, in order to force the focus on the most interesting part of the approximated Pareto front. The Matlab toolbox PlatEMO [39] and its implementation of these algorithms was used to obtain the presented results. The parameter configurations of all algorithms are given by: population size
from the last population obtain the solutions with from the solutions of step 1, find the non-dominated solution with minimal
Confusion matrices of the optimizers for 343 image patches
Confusion matrices of the optimizers for 343 noisy image patches
This procedure also ensures that the minimal possible value for the gripper opening width
To generate ground truth data for the assessment of the prediction quality of the Lowering Model, systematic search is carried out. For each image patch, all parameter configurations of the gripper are tested within a dense search grid whether they lead to a collision or not.1
The confusion matrices of the optimizers (obtained after two runs for each optimizer) on all 343 image patches are shown in Table 1. All confusion matrices are assessed with Matthews correlation coefficient (MCC) [40]. All three optimizers perform very well, with
Another important note for assessing the results is the No Free Lunch Theorem [45], which states that the performance of all optimizers is equally good if averaged over all possible given problems. More precisely, there needs to be a specific reason why an optimizer performs especially well on a problem [46]. In the optimization problem discussed here, a solution is only valid, when one objective function is zero (
Object shapes distorted by noise
In order to determine the impact of noise on the performance of the optimizers, noisy image patches are generated. The noise generation process is intended to simulate errors in the identification of the object borders during image segmentation, e.g. caused by changing lighting conditions, shadows or partly reflecting surfaces. For this purpose, dilation, erosion and closing operations [47] are used to morph the original object shape. This leads to distorted object boundaries: the object appears thinner or slightly thicker than the original object. For the dilation, erosion and closing operations, a 3
at each pixel position of the original image patch perform a dilation or erosion with a probability of 0.5 perform a closing operation on the image patch obtained from step 1 perform an erosion on the image patch obtained from step 2 to further deform the object
Example of a morphed image.
An example of the effect, obtained by this procedure, is depicted in Fig. 9. The confusion matrices of the optimizers on the noisy data are shown in Table 2. As can be seen in comparison to the results on the undistorted image patches in Table 1, on average 23 image patches are false positives. The reason for this classification result is that at object regions, where the object was made thinner through image morphing, the optimizer finds a solution that is not valid. This is also the explanation why the false negative results disappear. The resulting MCC values are:
The overall computational complexity of the Lowering Model in the inference step is determined by the computational complexity of the ELM and the computational complexity of the NSGA-II optimizer (which is chosen here because it shows the best performance). Since an ELM consists of two layers (not counting the input), the complexity is given by
Training data generation for the Closing and Lifting Model
The positive grasp postures (
The orthographic gripper camera used for generating the data for the Closing and Lifting Model is fixed to the gripper. Therefore, the gripper has the 0-orientation in all image sequences. The resolution of the camera is 1024
a. Example of a grasp sequence in V-REP for a negative grasp candidate. b. Example of a grasp sequence in V-REP for a positive grasp candidate. c. Example of an image sequence obtained in simulation. First published in [28].

The goal of the Closing Model is to predict the physical interaction between object and gripper in the image domain while closing the gripper. The input of the Closing Model is an image of the object and gripper after lowering the gripper. The output is the image after closing the gripper and the resulting gripper opening width
Confusion matrices of the Closing and Lifting Model
Confusion matrices of the Closing and Lifting Model
Confusion matrix of the Overall Model – individual results
The method and the evaluation are described in more detail in the following: The CNN model is trained with 15 000 positive and 15 000 negative image pairs and evaluated on 2500 positive and 2500 negative image pairs. The results are evaluated with respect to two measures. The first measure is the Intersection over Union (IOU) [51] value for determining how well the object and gripper movements are learned. The second measure is the predicted gripper opening width in terms of a confusion matrix: if the gripper was not closed in prediction and physics simulation the result is true positive, if the gripper was closed in prediction and physics simulation the result is true negative. False positives and false negatives are defined analogously. This measure determines how good the grasp candidates are classified. However, since the resulting image of the Closing Model is a necessary input for the Lifting Model, the IOU values are more important in this analysis. The resulting confusion matrix values for 2500 positive and 2500 negative image pairs (test data) are given in Table 3 (left), yielding a Matthews correlation coefficient of 0.945. The IOU value is 0.982 (sum over gripper and object channel). This means that the prediction of the gripper and object movement as well as the classification results of the Closing Model are good. Figure 12 shows an example for a true negative (top row) and a true positive (bottom row) grasp candidate.
The computational complexity of an inference step in a full CNN, as used in the Closing Model, is given by [2]:
where
The task of the Lifting Model is to predict for a two-channel object/gripper image, after closing the gripper, if the object falls out of the gripper or remains between the tips when the gripper is lifted. Since the object image after lifting is not needed for further processing steps, the Lifting Model is treated as classification task. This task is learned with a CNN since this is the state-of-the art method for image classification tasks. The input is a 2-Channel image and the output is computed through a sigmoid function. The output is thresholded for evaluation purposes to either 0 or 1. A standard CNN (Fig. 13) is used for classification in order to predict if the object falls out of the gripper during lifting. The training data are again gained from the V-REP simulation. The network is trained with 5000 positive and 5000 negative images and tested on 1000 positive and 1000 negative images. The resulting confusion matrix entries on the test data are given in Table 3 (right). The Matthews correlation coefficient of the confusion matrix is
Examples for the prediction of the Closing Model. Left column: input. Center column: prediction. Right column: ground truth.

Examples for positively (top row) and negatively (bottom row) classified images by the CNN of the Lifting Model.
The computational complexity of the inference step for the CNN in the Lifting model is mainly determined by Eq. (3). In analogy to the Closing Model, we argue here that the problem-specific complexity corresponds to
Evaluation and validation of the the Overall Model
Performance evaluation
The evaluation of the performance of the Overall Model is done with respect to two measures. The first measure is the confusion matrix derived from the individual confusion matrices after each of the three evaluation steps. Figure 15 shows the graph which is used for the determination of the overall matrix. After the first step, the Lowering Model, the entries of its confusion matrix are obtained, denoted by
Graph for the determination of the confusion matrix for the Overall Model.
They are chosen such that in each step of the model half of the candidates are positive and half negative. Table 4 shows the obtained confusion matrix: The overall true positive results are the grasp candidates that are correctly classified as positive by the Lowering, Closing and Lifting Model (
The Overall Model is also used for the interpretation of failed grasps. Therefore, the number of grasp candidates that are rejected for the wrong reason and the number of grasp candidates that are rejected for the correct reason are determined: The number of grasp candidates that have been rejected for the wrong reason is given by the sum of the false negative values for each prediction step
This means that only 7.4% of the rejected grasp candidates are rejected for the wrong reason. Thus, the interpretation of failed grasps is in 92.6% of the cases correct and reliable.
Confusion matrix of the Overall Model – overall result
Confusion matrix of the Overall Model – overall result
Example of two contact points 
In order to validate the stability of the predicted points, the grasp points determined by the Overall Model are compared to the grasp points determined by force closure [52]. Since data from a physics simulation have been used for the training of the individual models, the comparison with force closure also validates V-REP as a simulation tool. The force closure algorithm used here for comparison is an analytical method for finding force closure grasps of a 2D rigid object with two hard-finger point contacts with friction. When the gripper tips are in contact with the object, they exert forces on the object. These forces are called inner forces. When the inner forces can balance external forces and torques applied to the object, the object is said to be in force closure ([53], p. 223). The normal force
where
Application of the force closure algorithm to an orthographic object image. The object shape is indicated by the dashed grey line.
parametrize the boundaries of the object on the orthographic camera image with B-splines [55]. compute the tangent and normal vectors for every third pixel-sized point on the parametrized boundaries, with for all pairs of points on the boundaries, check if the force closure inequalities (4.2) are fulfilled, and save the pairs of boundary pixels that fulfil the inequalities as contact points being in force closure. eliminate all contact point pairs where the distance is greater than for the remaining pairs of contact points, the midpoints of the joining lines of each pair are the grasp points determined by force closure.
An example for the enumerated steps is given in Fig. 17. The B-spline parametrization of the boundary is indicated by the grey dashed line, and the tangent and normal vectors by the black and green arrows, respectively. The determined grasp points are indicated by the grey dots. Since the force closure approach assumes point contact, but the three-step model uses rectangular gripper tips, the grasp points cannot be compared directly. Instead, it is verified that the grasp points from the Overall Model are a subset of the grasp points computed by force closure. In order to determine the final grasp points from the Overall Model, the object translation and rotation predicted by the Closing Model has to be determined. Therefore, three image properties are determined for each thresholded object channel before closing and after closing: the centre of mass and the major and minor axis. From these three properties the translated grasp candidate is determined.
Example for grasp points found by the Overall Model (left) and the force closure approach (right). The grasp points are indicated by the grey dots.
Schematic view of the end-to-end application of the three-step model.
The Overall Model is evaluated on 50 objects with 420 grasp candidates to determine the liftable grasp candidates. For these grasp candidates, the translation predicted by the Closing Model is determined and applied to the grasp candidates determined by the Lowering Model, yielding the predicted grasp points of the Overall Model. A predicted grasp point is said to be a stable grasp point if at a maximum distance of 5.5px (corresponds to 0.5 cm) at least one grasp point from force closure is detected. From 100 remaining grasp candidates after the Lifting Model, 89% are also in accordance with the grasp candidates determined by force closure, while 11% are not. These 11% correspond to the false positive classified grasp candidates. This means that all true positive grasp candidates from the Overall Model have an equivalent in the force closure grasp points. Figure 18 shows an example of the grasp points found by the Overall Model in comparison to the force closure approach. As can be seen in this figure, there are always more grasp points found by force closure since point contact is assumed. For the shown object, there is only one feasible grasp candidate that is found by the Overall Model.
The three-step model presented in this study was successfully trained to determine from an orthographic camera image grasp points where the object can be grasped. Failed grasp attempts can directly be explained through the outcome of each of the three steps. The first step, the Lowering Model is realized as an optimization task with respect to a predicted gripper projection and an object image. In the second step, the Closing Model predicts the interaction between the gripper and the object while closing the gripper. In the third step, the Lifting Model classifies the predicted images of the Closing Model into liftable and non-liftable. In an end-to-end application, the input of the Overall Model is an image of the object, the output is a set of stable grasp points and a set with rejected grasp candidates (plus the information at which step they were rejected). The MCC values of the submodels amount to 0.98, 0.84, and 0.88, respectively, and of the Overall Model to 0.84. In case of grasp-point rejection, the correct cause is determined in 92.6% of the cases.
Compared to the ML-based wholistic approaches to robotic grasping which are mentioned in the introduction section [14, 15, 16, 17], the proposed three-step model allows for a causal attribution why a specific grasp candidates is not suitable. This an important advantage and makes such an approach potentially more useful for industrial applications. In our work, a successful implementation of this approach was shown for single flat objects on a workplate in combination with a robot arm with two-finger gripper and two orthographic cameras, one of them positioned centrally between the gripper tips. To transfer this approach and setting to the real world, the main challenge is to reliably detect the object shape in the camera images. For this purpose, many ML-based algorithms for object shape recognition exist [3, 4].
As was shown for the Lowering Model, noise in the image acquisition process that distorts the object shapes, influences the classification performance only to a minor extent. Since the object images are further downscaled for the Closing Model and Lifting Model, small distortions of the object boundaries will have an even smaller effect on the performance of these two models as well as on the performance of the Overall Model.
The large amounts of required training data for deep learning were generated in simulation instead of the real world. This has many obvious advantages like cost and time savings, but also pitfalls like the deviations between the simulation and physical reality (which we assessed via the successful validation with the force closure method). We want to point out an additional advantage of using simulations which comes to play in our study: In simulation, additional data can be acquired which is not easily available in a real-world setting. In the presented approach, this concerns the camera placed between the gripper tips to record the training data for the Closing Model even when the gripper is nearly or completely closed. This would not be possible in this straightforward way in the real world because of occlusions. Does this prevent the transfer to the real world? In training yes, but – and this is the crucial point – not during the inference phase during application. In this phase, only the camera image before closing the gripper is required (as input to the Closing Model).
In the current version of the three-step model, the Closing Model and the Lifting Model are implemented by CNNs. This is a straightforward design decision since CNNs are state-of-the-art in the areas of image-to-image regression and image classification. However, especially in the field of classification there are many competing machine learning algorithms which may be promising alternatives. Recent developments are for example neural dynamic classification algorithms [56], neural ensemble methods with only a small set of hyperparameters [57], or finite element machine classifiers where the whole training set is modeled as a probabilistic manifold [58]. Especially neural dynamic classification achieves competitive results in image classification tasks (shown in [56] for the MNIST datase [59]).
As the optimization approach for the Lowering Model has the biggest impact on the computational costs in the inference step according to our analysis of the computational complexity of the submodels, it is also desirable to reduce the computational effort for the Lowering Model. This could be achieved by using interactive preference articulation for the parameters to be determined [60]. In contrast to NSGA-II/-III, this approach allows to adapt the region of interest on the Pareto front during the optimization process.
The next planned steps of our research are the following: (1) Adapt the model to a real-world setting in the application phase and adjust the preceding training process in simulation as necessary. (2) Work with generic object shapes in 3D, allowing for arbitrary objects: For this purpose, depth cameras have to be used, and the submodels have to be changed accordingly. The main challenge is the handling of uncertainties on the depth images due to changes in perspective and occlusions. For this reason, the Closing Model has to be implemented on the basis of artificial neural networks with probabilistic properties. (3) Work with full-color images instead of object shapes to recognize objects more specifically. In this way, the predictions of the trained Closing and Lifting Models could account for an uneven mass distribution and varying friction coefficients.
Thus, there are still open research questions before the proposed approach can be applied to arbitrary real-world settings. Nevertheless, the contribution of our paper is to put forward the three-step approach which allows for clear causal attributions why specific grasp candidates are not suitable, and why an object may be completely non-graspable. This is highly valuable information for any industrial application in which ML-based systems often are avoided as long as they are a complete “black box”. In contrast, we could demonstrate that our proposed Overall Model can be successfully trained and applied to a specific setting in robotic grasping and gives the user the promised insights.
Conclusions
In this study, it was proposed to divide the task of robotic grasping into three simpler tasks that allow for the interpretation of failed grasp attempts. The models that have to be learned to accomplish these three tasks are the Lowering Model, the Closing Model and the Lifting Model. Overall, the evaluation of each single model shows with respect to classification performance very good results. The evaluation of the end-to-end application of the Overall Model shows that the causal attribution of failed grasps is reliable in 92% of the cases. Furthermore, the classification performance of the Overall Model is only slightly worse than the individual classification performances. Comparison with the force closure algorithm shows that all grasp points identified by the Overall Model as final true positives are stable with respect to external torques and wrenches. This is an important validation of the Overall Model and the underlying simulation approach using V-REP and the bullet physics engine.
Footnotes
The grid for the systematic search is:
Acknowledgments
This work was supported by the EFRE-NRW funding programme “Forschungsinfrastrukturen” (grant no. 34.EFRE-0300119).
We are also thankful to Prof. Ralf Möller from Bielefeld University for many helpful discussions and his overall support.
