Abstract
With the advancement of intelligent manufacturing technology, there is an urgent demand for robots to replace humans in high-precision, heavy-duty, and other transportation tasks. In contrast to large-scale robotic arm manipulation, this work focuses on achieving suction-based grasping and transportation using an unmanned ground vehicle equipped with a small robotic arm. Compared with large manipulators, small robotic arms require higher precision and have lower fault tolerance. To improve exploration efficiency and sample utilization, this work proposes an enhanced multi-agent deep reinforcement learning method called curiosity-driven Multi-Agent Deep Deterministic Policy Gradient (CD-MADDPG). This method integrates a curiosity-driven prioritized experience replay mechanism into the multi-agent deep deterministic policy gradient framework, where the prediction residual of a forward dynamics model is used to quantify the novelty of samples, guiding the agent toward underexplored states. In addition, a decoupling strategy is adopted, where each joint of a single robotic arm is treated as an independent agent, thereby transforming the high-dimensional action space into a low-dimensional one, complemented by a designed global reward mechanism. Experimental results demonstrate that CD-MADDPG achieves a 19% improvement in success rate and a 44% improvement in distance accuracy compared to the non-decoupled single-agent counterpart, effectively accomplishing the suction-based grasping and transportation task.
Keywords
Introduction
With the advent of Industry 4.0, intelligent manufacturing has emerged as a cornerstone of modern industrial transformation, driving the integration of advanced robotics to enhance flexibility, efficiency, and adaptability in production processes.1–3 Robotic manipulators, particularly those equipped with suction end-effectors, have become indispensable for tasks such as assembly, grasping, and material handling in unstructured environments. Suction-based operations offer advantages in handling delicate or irregular objects without mechanical damage, yet they demand precise real-time control of end-effector pose, including position, orientation, and approach height. As manufacturing shifts toward dynamic, cluttered workspaces, the demand for collaborative, adaptive control strategies has intensified, positioning multi-agent reinforcement learning (MADRL) as a promising paradigm for achieving robust robotic suction performance.
However, robotic suction grasping in unstructured and dynamic settings presents formidable challenges. The UR5 manipulator’s 6-degree-of-freedom (6-DoF) configuration results in a high-dimensional continuous action space, exacerbated by kinematic coupling among joints, which complicates exploration and leads to sample inefficiency. Traditional model-based controllers (e.g., model predictive control or impedance control) struggle with non-stationary targets and sparse rewards, while conventional deep reinforcement learning (DRL) approaches often fail to maintain stability under varying object positions and orientations.4,5 Moreover, single-agent DRL cannot effectively coordinate multiple joints for precise multi-metric alignment (horizontal distance, vertical height, and angular deviation), resulting in low success rates and slow convergence in sparse-reward environments.
Existing research has explored various solutions to robotic manipulation control. Trajectory planning techniques, such as hybrid A* and particle swarm optimization or free-space decomposition methods, have improved path smoothness but exhibit limited adaptability to dynamic obstacles and real-time pose requirements.6–8 Single-agent DRL algorithms, including DDPG, PPO, and TD3, have demonstrated success in simplified pick-and-place tasks yet suffer from the curse of dimensionality and poor exploration in high-DoF systems.9–11 Recent advances in multi-agent DRL have addressed collaboration by treating dual arms or finger groups as independent agents, achieving better coordination in assembly and grasping.12–15 Nevertheless, these methods typically rely on uniform experience replay, which neglects the novelty of under-explored states, and lack tailored reward designs for suction-specific metrics, leading to suboptimal precision and slow convergence in cluttered suction scenarios.
To overcome these limitations, this study proposes curiosity-driven multi-agent deep deterministic policy gradient (CD-MADDPG), an enhanced multi-agent deep deterministic policy gradient (MADDPG) framework that integrates a curiosity-driven prioritized experience replay (CD-PER) mechanism with joint decoupling. Each of the six joints of the UR5 arm is modeled as an independent agent, transforming the high-dimensional action space into six low-dimensional subspaces while preserving global coordination through centralized training and decentralized execution. A forward dynamics model quantifies sample novelty via prediction residuals, enabling prioritized replay of high-curvature experiences for efficient exploration. 16 Furthermore, a composite reward function balances horizontal proximity, angular alignment, and height clearance, ensuring suction-specific precision. This approach not only mitigates non-stationarity in multi-agent environments but also significantly improves sample utilization and convergence speed.
This study investigates robotic suction grasping in dynamic environments. A robotic arm equipped with a suction end-effector serves as the actuator, where each joint is modeled as an individual agent through a decoupling control strategy. During the learning process, critical factors including end-effector orientation, distance to target, and approach height are incorporated. A MADRL framework is constructed to facilitate the training of the robotic arm.
(1) A simulated environment for robotic suction is developed by decoupling joints as independent agents and incorporating randomized target positioning. (2) The CD-MADDPG algorithm is proposed, integrating curiosity-driven PER to optimize experience prioritization and enhance noise robustness. (3) A multi-metric reward function is introduced, balancing joint angles, target proximity, and height requirements
The remainder of this paper is organized as follows. The “Relate work” section reviews related work on DRL-based robotic control. The “System model” section establishes the UR5 kinematic model and experimental setup. The “CD-MADDPG-Based Control Strategy for Robotic Manipulator” section details the CD-MADDPG algorithm, including Markov decision formulation, curiosity-driven replay, and reward design. The “Experimental Design and Analysis” section presents comprehensive simulation results and analysis. Finally, The “Conclusion” section concludes the paper and outlines future directions.
Relate work
For the problem of trajectory planning and control of robotic manipulators in structured environments, previous studies have primarily relied on model-based approaches. Spahn et al. proposed a model predictive control framework combined with free-space decomposition to optimize coupled trajectories for mobile manipulators. Sadiq et al. integrated A* search with particle swarm optimization to generate smooth trajectories. Recent works have further adopted DMP-PSO frameworks to enhance joint-angle optimization and time-optimal trajectory planning.17,18 However, these model-based methods depend heavily on accurate environmental models and suffer from high computational costs and limited real-time adaptability when target positions and orientations change unpredictably in unstructured scenes. In contrast, the proposed CD-MADDPG employs a model-free MADRL paradigm that eliminates the need for explicit environment modeling, thereby enabling faster adaptation to dynamic suction tasks.
For the challenge of high-dimensional action spaces and sparse rewards in single-agent DRL for robotic grasping, existing methods have achieved notable progress in simplified pick-and-place tasks. Liu et al., Wang et al. (2026), and Hazem et al. applied DDPG, TD3, PPO, and SAC to learn end-to-end policies directly from raw states or image pixels, demonstrating strong sim-to-real transfer capabilities.19–21 To address the complexity of high-DoF systems, Tutsoyet al. 22 proposed reinforcement learning combined with symbolic inverse kinematics to generate reduced-order control signals for humanoid robots such as the NAO.
Hou et al. 23 further divided the robotic arm into multiple control segments and experimentally demonstrated that grasping performance improves significantly when the number of control segments exceeds four. Therefore, the proposed method decouples the six joints of the UR5 arm into independent agents, substantially reducing the action-space dimensionality while preserving global coordination and achieving faster convergence and higher precision than single-agent baselines.
For the coordination problem in multi-robot or multi-joint manipulation, multi-agent DRL frameworks have been introduced to enhance stability. Chen et al., 24 Low et al., 25 Aina et al., and Waseem et al. 26 treated dual arms or finger groups as independent agents under the centralized-training decentralized-execution paradigm, achieving improved performance in assembly and dexterous grasping tasks.
Although these studies mitigate non-stationarity, they still rely on uniform experience replay and lack suction-specific reward designs. In comparison, the proposed CD-MADDPG not only adopts joint decoupling but also integrates a CD-PER mechanism that prioritizes novel samples, thereby significantly enhancing exploration efficiency in sparse-reward suction scenarios—an advantage not found in existing multi-agent approaches.
For the critical issue of exploration inefficiency in model-free reinforcement learning, curiosity-driven mechanisms and prioritized experience replay (PER) have been widely investigated. Lanier et al. and Dong et al. 27 utilized forward dynamics models to quantify prediction residuals and prioritize novel transitions. Xue et al. 28 and Farooq et al. applied curiosity-enhanced methods to accelerate learning in manipulation tasks. 29 Tutsoy et al. 30 further proposed an exploration-exploitation-based adaptive law for intelligent model-free control, validated through real-time experiments. Despite these advances, few studies combine curiosity-driven prioritization with multi-agent joint decoupling, especially for suction grasping that requires simultaneous optimization of distance, height, and orientation. The proposed CD-MADDPG fills this gap by directly integrating curiosity-driven PER into the MADDPG framework, achieving higher sample utilization and faster convergence than prior curiosity-only or multi-agent-only methods.
In summary, although previous works have individually addressed trajectory planning, single-agent DRL, multi-agent coordination, and exploration mechanisms, critical gaps remain in handling high-dimensional joint coupling, multi-metric suction requirements, and efficient exploration in unstructured environments.31,32 The proposed CD-MADDPG overcomes these limitations through joint decoupling, curiosity-driven prioritized replay, and a composite reward function, achieving a success rate of 96% (19% higher than baseline algorithms) and converging to a minimum distance of 1.45 cm (44% better than baseline algorithms). It demonstrates clear advantages in precision, convergence speed, and robustness over all compared methods.
System model
Scenario and model
The UR5 robotic arm (Figure 1) possesses a serial linkage mechanism, with its functionality divided between an arm segment (joints 1–3) and a wrist segment (joints 4–6), which collectively enable versatile end-effector positioning.
33
Its kinematic model is governed by the Denavit–Hartenberg (D-H) convention, which defines four parameters per link: the link length (

Structure diagram of UR5 manipulator.
Forward kinematics of manipulator
According to the standard D-H convention, the homogeneous transformation matrix of the
From the properties of the homogeneous transformation matrix of the robotic arm, it is known that the vectors

Denavit–Hartenberg parameter coordinate frame diagram.
Among these, the commonly used abbreviations in robotics are adopted:

Experimental scene diagram of a single robotic arm.
Experimental setup and evaluation metrics
As shown in Figures 3 and 4, the simulation environment consists of a UR5 robotic arm and a pink cylinder representing the object to be suctioned. Task completion is evaluated based on three criteria: horizontal distance, height, and angular deviation. The horizontal distance refers to the distance in the

The manipulator in action, shown from a front-facing viewpoint.

Convergence of success rates for various algorithms.
When
The height metric refers to the shortest distance between the suction cup of the robotic arm’s end-effector and the plane perpendicular to the contact surface of the target object. Let
Mathematical parameters.
CD-MADDPG-based control strategy for robotic manipulator
Markov decision analysis
In our multi-robot system, each manipulator operates as an autonomous agent within the shared environment. At every time step State Observation Action space To guide the manipulator in completing the suction operation, we design a composite reward function comprising horizontal distance, angular deviation, and height deviation. The specific definitions of each component are as follows. This design encourages the end-effector to point downward ( This penalty term encourages the end-effector to remain within a height range near the target surface.
Curiosity-driven experience replay mechanism
Standard MADDPG employs a uniform sampling strategy that overlooks the non-uniform distribution of learning value among experiences. While TD-error-based PER mitigates this, it often suffers from insufficient exploration in sparse-reward environments. To address this issue, we introduce a CD-PER mechanism
16
and integrate it with MADDPG. By leveraging the prediction residual of a forward dynamics model to quantify sample novelty, CD-PER assigns higher replay weights to transitions with larger errors, which indicate under-fitted dynamical features. This shift from value-based to cognition-based prioritization significantly enhances sample efficiency in complex tasks.
DRL algorithm
This work introduces a MADDPG algorithm employing the Centralized Training with Decentralized Execution (CTDE) paradigm, designed to balance system-wide coordination with individual agent performance in multi-robot arm systems.
The policy gradient for each agent is formulated as:
Each agent’s value network is optimized via the loss function:
CD-PER-MADDPG Training Procedure.
Analysis of the joint-decoupled multi-agent formulation
In this work, each joint of the robotic arm is modeled as an independent agent. This section provide a formulation and analysis of how this decoupling affects coordination and learning.
Following the standard factorization framework in MADRL,
35
we model the joint policy of the 6-DOF manipulator as the product of individual policies for each joint agent:
Following the sequential structure insight proposed by Ramesh et al.,
36
multi-link manipulators possess an inherently decoupled nature, where the state and action spaces grow linearly with the number of links. Based on this, we formulate the joint action space as the Cartesian product of each joint’s action space:
During centralized training, the Critic network of each agent receives the actions of all joints along with the global observation:
37
Experimental design and analysis
Simulation parameters
To evaluate the performance of our proposed method, systematic experiments were carried out in a PyBullet-based simulation environment serving as the training platform for robotic manipulation. The experimental setup consisted of multiple UR5 robotic arms and target objects modeled as cylinders with dimensions of 5 cm in diameter and 3 cm in height, each centered at their geometric centroid. To prevent interference with the robotic arm base positioned at the origin, target objects were placed within the coordinate range of [20, 80] along both x and y axes. In the PyBullet environment, the robotic arm model is described using a URDF file, which includes a
Lain of primary simulation parameters.
CD-MADDPG experimental analysis
In this section, an experiment employing a single robotic arm and a target object is conducted. The results demonstrate the performance of algorithms in terms of success rate, reward, angle, and distance.
Figure 5 illustrates the convergence characteristics of the success rates for the three evaluated algorithms. It can be observed that CD-MADDPG achieves the best overall performance, with its success rate peaking at approximately 13,000 episodes, subsequently maintaining a high-level fluctuation between 80% and 100%, and ultimately converging at

Convergence of reward for various algorithms.
Figure 7 illustrates the convergence trends of the minimum distance for the three evaluated algorithms during single-arm manipulation tasks, a metric highly correlated with the convergence characteristics of the cumulative reward. In terms of the initial descent rate, CD-MADDPG significantly outperforms the other three algorithms, demonstrating superior learning efficiency, followed by DDPG-Single, then MADDPG, while DDPG exhibits the least favorable performance. It can be observed that the distance metric of DDPG fails to achieve effective learning, converging to

Convergence of minimum distance for various algorithms.

Convergence of the angular error for various algorithms.

Convergence of height for various algorithms.
Conclusion
This paper focuses on the location problem of decoupled robotic arms in achieving suction-based grasping and transportation tasks within environments featuring randomly generated target positions. By integrating a curiosity-driven experience replay mechanism, we propose the CD-MADDPG algorithm, which leverages prediction residuals from state transitions generated by a forward dynamics model to evaluate the exploration value of samples, and decouples the robotic arm to reduce the dimensionality of actions. Simulation results demonstrate that CD-MADDPG outperforms baseline algorithms, achieving a 19% improvement in success rate and a 44% improvement in distance accuracy. Future work will focus on deploying the proposed algorithm onto physical robotic platforms and further investigating its robustness in more complex environments containing dynamic obstacles. In future work, we plan to build upon this study by introducing various influencing factors, such as environmental obstacles, to conduct more comprehensive experiments.
Footnotes
Author contributions
Conceptualization, Y.A. and W.W.; methodology, Y.A. and Z.S.; software, Z.L. and Y.A.; validation, Z.L., W.W. and Y.L.; formal analysis, W.W. and Z.S.; investigation, Y.A. and Z.L.; resources, Y.L. and Z.S.; data curation, Y.A. and Z.L.; writing—original draft preparation, Y.A. and W.W.; writing—review and editing, Y.L. and Z.S.; visualization, Z.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. and Z.S. All authors have read and agreed to the published version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Natural Science Foundation of Fujian Province, China (Grant No. 2021J011112, 2023J011011); the Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (Grant No. MJUKF-IPIC202406); the Science and Technology Bureau of Putian, Fujian Province, China (Grant No. 2022GZ2001ptxy11, 2024GZ3001PTXY19); the Startup Fund for Advanced Talents of Putian University (Grant No. 2024039); and the Postgraduate Research Project of Putian University (Grant No. yjs2024033).
Declaration of conflicting interests
The authors declare that there is no conflict of interest with respect to the research, authorship, and/or publication of this article.
