Collaborative control of robotic arm suction via joint-decoupled multi-agent reinforcement learning

Abstract

With the advancement of intelligent manufacturing technology, there is an urgent demand for robots to replace humans in high-precision, heavy-duty, and other transportation tasks. In contrast to large-scale robotic arm manipulation, this work focuses on achieving suction-based grasping and transportation using an unmanned ground vehicle equipped with a small robotic arm. Compared with large manipulators, small robotic arms require higher precision and have lower fault tolerance. To improve exploration efficiency and sample utilization, this work proposes an enhanced multi-agent deep reinforcement learning method called curiosity-driven Multi-Agent Deep Deterministic Policy Gradient (CD-MADDPG). This method integrates a curiosity-driven prioritized experience replay mechanism into the multi-agent deep deterministic policy gradient framework, where the prediction residual of a forward dynamics model is used to quantify the novelty of samples, guiding the agent toward underexplored states. In addition, a decoupling strategy is adopted, where each joint of a single robotic arm is treated as an independent agent, thereby transforming the high-dimensional action space into a low-dimensional one, complemented by a designed global reward mechanism. Experimental results demonstrate that CD-MADDPG achieves a 19% improvement in success rate and a 44% improvement in distance accuracy compared to the non-decoupled single-agent counterpart, effectively accomplishing the suction-based grasping and transportation task.

Keywords

Robot arm multi-agent deep deterministic policy gradient curiosity-driven replay deep reinforcement learning multi-agent cooperation

Introduction

With the advent of Industry 4.0, intelligent manufacturing has emerged as a cornerstone of modern industrial transformation, driving the integration of advanced robotics to enhance flexibility, efficiency, and adaptability in production processes.^1–3 Robotic manipulators, particularly those equipped with suction end-effectors, have become indispensable for tasks such as assembly, grasping, and material handling in unstructured environments. Suction-based operations offer advantages in handling delicate or irregular objects without mechanical damage, yet they demand precise real-time control of end-effector pose, including position, orientation, and approach height. As manufacturing shifts toward dynamic, cluttered workspaces, the demand for collaborative, adaptive control strategies has intensified, positioning multi-agent reinforcement learning (MADRL) as a promising paradigm for achieving robust robotic suction performance.

However, robotic suction grasping in unstructured and dynamic settings presents formidable challenges. The UR5 manipulator’s 6-degree-of-freedom (6-DoF) configuration results in a high-dimensional continuous action space, exacerbated by kinematic coupling among joints, which complicates exploration and leads to sample inefficiency. Traditional model-based controllers (e.g., model predictive control or impedance control) struggle with non-stationary targets and sparse rewards, while conventional deep reinforcement learning (DRL) approaches often fail to maintain stability under varying object positions and orientations.^4,5 Moreover, single-agent DRL cannot effectively coordinate multiple joints for precise multi-metric alignment (horizontal distance, vertical height, and angular deviation), resulting in low success rates and slow convergence in sparse-reward environments.

Existing research has explored various solutions to robotic manipulation control. Trajectory planning techniques, such as hybrid A* and particle swarm optimization or free-space decomposition methods, have improved path smoothness but exhibit limited adaptability to dynamic obstacles and real-time pose requirements.^6–8 Single-agent DRL algorithms, including DDPG, PPO, and TD3, have demonstrated success in simplified pick-and-place tasks yet suffer from the curse of dimensionality and poor exploration in high-DoF systems.^9–11 Recent advances in multi-agent DRL have addressed collaboration by treating dual arms or finger groups as independent agents, achieving better coordination in assembly and grasping.^12–15 Nevertheless, these methods typically rely on uniform experience replay, which neglects the novelty of under-explored states, and lack tailored reward designs for suction-specific metrics, leading to suboptimal precision and slow convergence in cluttered suction scenarios.

To overcome these limitations, this study proposes curiosity-driven multi-agent deep deterministic policy gradient (CD-MADDPG), an enhanced multi-agent deep deterministic policy gradient (MADDPG) framework that integrates a curiosity-driven prioritized experience replay (CD-PER) mechanism with joint decoupling. Each of the six joints of the UR5 arm is modeled as an independent agent, transforming the high-dimensional action space into six low-dimensional subspaces while preserving global coordination through centralized training and decentralized execution. A forward dynamics model quantifies sample novelty via prediction residuals, enabling prioritized replay of high-curvature experiences for efficient exploration.¹⁶ Furthermore, a composite reward function balances horizontal proximity, angular alignment, and height clearance, ensuring suction-specific precision. This approach not only mitigates non-stationarity in multi-agent environments but also significantly improves sample utilization and convergence speed.

This study investigates robotic suction grasping in dynamic environments. A robotic arm equipped with a suction end-effector serves as the actuator, where each joint is modeled as an individual agent through a decoupling control strategy. During the learning process, critical factors including end-effector orientation, distance to target, and approach height are incorporated. A MADRL framework is constructed to facilitate the training of the robotic arm.

(1) A simulated environment for robotic suction is developed by decoupling joints as independent agents and incorporating randomized target positioning.

(2) The CD-MADDPG algorithm is proposed, integrating curiosity-driven PER to optimize experience prioritization and enhance noise robustness.

(3) A multi-metric reward function is introduced, balancing joint angles, target proximity, and height requirements

The remainder of this paper is organized as follows. The “Relate work” section reviews related work on DRL-based robotic control. The “System model” section establishes the UR5 kinematic model and experimental setup. The “CD-MADDPG-Based Control Strategy for Robotic Manipulator” section details the CD-MADDPG algorithm, including Markov decision formulation, curiosity-driven replay, and reward design. The “Experimental Design and Analysis” section presents comprehensive simulation results and analysis. Finally, The “Conclusion” section concludes the paper and outlines future directions.

Relate work

For the problem of trajectory planning and control of robotic manipulators in structured environments, previous studies have primarily relied on model-based approaches. Spahn et al. proposed a model predictive control framework combined with free-space decomposition to optimize coupled trajectories for mobile manipulators. Sadiq et al. integrated A* search with particle swarm optimization to generate smooth trajectories. Recent works have further adopted DMP-PSO frameworks to enhance joint-angle optimization and time-optimal trajectory planning.^17,18 However, these model-based methods depend heavily on accurate environmental models and suffer from high computational costs and limited real-time adaptability when target positions and orientations change unpredictably in unstructured scenes. In contrast, the proposed CD-MADDPG employs a model-free MADRL paradigm that eliminates the need for explicit environment modeling, thereby enabling faster adaptation to dynamic suction tasks.

For the challenge of high-dimensional action spaces and sparse rewards in single-agent DRL for robotic grasping, existing methods have achieved notable progress in simplified pick-and-place tasks. Liu et al., Wang et al. (2026), and Hazem et al. applied DDPG, TD3, PPO, and SAC to learn end-to-end policies directly from raw states or image pixels, demonstrating strong sim-to-real transfer capabilities.^19–21 To address the complexity of high-DoF systems, Tutsoyet al.²² proposed reinforcement learning combined with symbolic inverse kinematics to generate reduced-order control signals for humanoid robots such as the NAO.

Hou et al.²³ further divided the robotic arm into multiple control segments and experimentally demonstrated that grasping performance improves significantly when the number of control segments exceeds four. Therefore, the proposed method decouples the six joints of the UR5 arm into independent agents, substantially reducing the action-space dimensionality while preserving global coordination and achieving faster convergence and higher precision than single-agent baselines.

For the coordination problem in multi-robot or multi-joint manipulation, multi-agent DRL frameworks have been introduced to enhance stability. Chen et al.,²⁴ Low et al.,²⁵ Aina et al., and Waseem et al.²⁶ treated dual arms or finger groups as independent agents under the centralized-training decentralized-execution paradigm, achieving improved performance in assembly and dexterous grasping tasks.

Although these studies mitigate non-stationarity, they still rely on uniform experience replay and lack suction-specific reward designs. In comparison, the proposed CD-MADDPG not only adopts joint decoupling but also integrates a CD-PER mechanism that prioritizes novel samples, thereby significantly enhancing exploration efficiency in sparse-reward suction scenarios—an advantage not found in existing multi-agent approaches.

For the critical issue of exploration inefficiency in model-free reinforcement learning, curiosity-driven mechanisms and prioritized experience replay (PER) have been widely investigated. Lanier et al. and Dong et al.²⁷ utilized forward dynamics models to quantify prediction residuals and prioritize novel transitions. Xue et al.²⁸ and Farooq et al. applied curiosity-enhanced methods to accelerate learning in manipulation tasks.²⁹ Tutsoy et al.³⁰ further proposed an exploration-exploitation-based adaptive law for intelligent model-free control, validated through real-time experiments. Despite these advances, few studies combine curiosity-driven prioritization with multi-agent joint decoupling, especially for suction grasping that requires simultaneous optimization of distance, height, and orientation. The proposed CD-MADDPG fills this gap by directly integrating curiosity-driven PER into the MADDPG framework, achieving higher sample utilization and faster convergence than prior curiosity-only or multi-agent-only methods.

In summary, although previous works have individually addressed trajectory planning, single-agent DRL, multi-agent coordination, and exploration mechanisms, critical gaps remain in handling high-dimensional joint coupling, multi-metric suction requirements, and efficient exploration in unstructured environments.^31,32 The proposed CD-MADDPG overcomes these limitations through joint decoupling, curiosity-driven prioritized replay, and a composite reward function, achieving a success rate of 96% (19% higher than baseline algorithms) and converging to a minimum distance of 1.45 cm (44% better than baseline algorithms). It demonstrates clear advantages in precision, convergence speed, and robustness over all compared methods.

System model

Scenario and model

The UR5 robotic arm (Figure 1) possesses a serial linkage mechanism, with its functionality divided between an arm segment (joints 1–3) and a wrist segment (joints 4–6), which collectively enable versatile end-effector positioning.³³ Its kinematic model is governed by the Denavit–Hartenberg (D-H) convention, which defines four parameters per link: the link length ( $a_{i}$ ) as the common normal distance; the link offset ( $d_{i}$ ) as the translational displacement along the prior $z$ -axis; the link twist angle ( $α_{i}$ ) between consecutive $z$ -axes; and the joint angle ( $θ_{i}$ ) of revolution. The sign of these parameters, evidenced by negative values for $a_{2}$ and $a_{3}$ , is inherently determined during the coordinate system assignment phase, signaling an orientation opposite to the referenced axis. The complete set of these values, as provided, fully encapsulates the spatial relationships between all links of the UR5, thereby forming the basis for deriving its forward kinematics.

Figure 1.

Structure diagram of UR5 manipulator.

Forward kinematics of manipulator

According to the standard D-H convention, the homogeneous transformation matrix of the $i$ th link frame relative to the previous link frame is given by:³⁴

_{i}^{i - 1} T = {Rot}_{z} (θ_{i}) {Trans}_{z} (d_{i}) {Rot}_{x} (α_{i}) {Trans}_{x} (a_{i}),

(1)

Therefore, it can be expressed as:

_{i}^{i - 1} T = [\begin{matrix} \cos (θ_{i}) & - \sin (θ_{i}) \cos (α_{i}) & \sin (θ_{i}) \sin (α_{i}) & a_{i} \cos (θ_{i}) \\ \sin (θ_{i}) & \cos (θ_{i}) \cos (α_{i}) & - \cos (θ_{i}) \sin (α_{i}) & a_{i} \sin (θ_{i}) \\ 0 & \sin (α_{i}) & \cos (α_{i}) & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}],

(2)

The homogeneous transformation matrix of the robotic manipulator’s end-effector frame (frame 6) with respect to the Cartesian base frame (frame 0) can be expressed as:

_{6}^{0} T =_{1}^{0} T \dots_{6}^{5} T = [\begin{matrix} R & P \\ 0 & 1 \end{matrix}] = [\begin{matrix} n_{x} & o_{x} & a_{x} & p_{x} \\ n_{y} & o_{y} & a_{y} & p_{y} \\ n_{z} & o_{z} & a_{z} & p_{z} \\ 0 & 0 & 0 & 1 \end{matrix}]

(3)

From the properties of the homogeneous transformation matrix of the robotic arm, it is known that the vectors $[n_{x}, n_{y}, n_{z}]$ , $[o_{x}, o_{y}, o_{z}]$ , and $[a_{x}, a_{y}, a_{z}]$ represent the directional vectors of the end-effector’s Cartesian coordinate frame. As shown in the Figure 2, the orientation of the end-effector $z_{6}$ is consistent with the direction of the $z$ -axis; hence, the $z$ -axis orientation represented by $[a_{x}, a_{y}, a_{z}]$ corresponds to the end-effector’s approach direction. Meanwhile, $[p_{x}, p_{y}, p_{z}]$ represents the three-dimensional (3D) spatial coordinates of the end-effector. These can be expressed respectively as:

{\begin{matrix} p_{x} = a_{2} c_{2} s_{1} - d_{4} c_{1} - a_{3} c_{1} s_{2} s_{3} - d_{6} (c_{1} c_{5} + c_{234} s_{1} s_{5}) \\ + d_{5} s_{234} c_{1} + a_{3} c_{1} c_{2} c_{3} \\ p_{y} = a_{2} c_{2} s_{1} - d_{4} c_{1} - d_{6} (c_{1} c_{5} + c_{234} s_{1} s_{5}) \\ + d_{5} s_{234} s_{1} + a_{3} c_{2} c_{3} s_{1} - a_{3} s_{1} s_{2} s_{3} \\ p_{z} = d_{1} + d_{5} (s_{23} s_{4} - c_{23} c_{4}) + a_{3} s_{23} + a_{2} s_{2} \\ - d_{6} s_{5} (c_{23} s_{4} + s_{23} c_{4}) \end{matrix}

(4)

Figure 2.

Denavit–Hartenberg parameter coordinate frame diagram.

Among these, the commonly used abbreviations in robotics are adopted: $c_{i} = \cos θ_{i}$ , $s_{i} = \sin θ_{i}$ , $c_{i j} = \cos (θ_{i} + θ_{j})$ , $s_{i j} = \sin (θ_{i} + θ_{j})$ , $c_{i j k} = \cos (θ_{i} + θ_{j} + θ_{k})$ , and so forth. $a_{i}$ and $d_{i}$ are the D-H parameters, the specific values of which are shown in Figure 3.

Figure 3.

Experimental scene diagram of a single robotic arm.

Experimental setup and evaluation metrics

As shown in Figures 3 and 4, the simulation environment consists of a UR5 robotic arm and a pink cylinder representing the object to be suctioned. Task completion is evaluated based on three criteria: horizontal distance, height, and angular deviation. The horizontal distance refers to the distance in the $X$ – $Y$ plane between the center point of the suction cup and the center point of the target object. As shown in the top-view Figure 5, let $p_{m} = (x_{m}, y_{m})$ denote the center point of the suction cup, and let $p_{g} = (x_{g}, y_{g})$ is the center of mass of the target object. The horizontal distance $L_{s}$ can be expressed as:

L_{s} = \sqrt{(x_{m} - x_{g})^{2} + (y_{m} - y_{g})^{2}}

(5)

Figure 4.

The manipulator in action, shown from a front-facing viewpoint.

Figure 5.

Convergence of success rates for various algorithms.

When $L_{s} \leq {\bar{ζ}}_{d}$ , the current action can be considered as meeting the criterion for horizontal distance, where ${\bar{ζ}}_{d}$ represents the allowable error tolerance in the horizontal distance metric.

The height metric refers to the shortest distance between the suction cup of the robotic arm’s end-effector and the plane perpendicular to the contact surface of the target object. Let $H_{m}$ denote the $z$ axis of the center point of the suction cup (i.e., the height of the end-effector), and $H_{g}$ represent the $z$ axis of the contact surface of the object. Whether the current action satisfies the height requirement can be expressed as:

η_{h} \leq H_{m} - H_{g} \leq η_{h} + {\bar{ζ}}_{h}

(6)

where

η_{h}

is the reserved height for the suction operation of the robotic arm, and

{\bar{ζ}}_{h}

is the allowable error tolerance in the height dimension. The vertical angle

\hat{α}

refers to the angle between the orientation of the end-effector suction cup and the normal vector perpendicular to the contact surface of the target object. When

\hat{α} < {\bar{ζ}}_{α}

, it is considered that the task is successful in terms of angle Table 1.

Table 1.

Mathematical parameters.

Symbol	Description
$s_{i}, o_{i}$	State and observation of agent $i$
$a_{i}$	Action of agent $i$
$r_{t o t a l}$	Total reward
$γ$	Discount factor
$π_{i}$	Policy of agent $i$
$Q_{i}^{μ}$	Centralized action-value function
$θ_{i}, ψ_{i}$	Actor and Critic parameters
$τ$	Soft update coefficient
$D$	Replay buffer
$f_{ϕ}$	Forward dynamics model
$ϕ$	Dynamics model parameters
${\hat{s}}_{t + 1}$	Predicted next state
$e_{t}$	Prediction error (Curiosity signal)
$p_{i}$	Experience priority
$ϵ$	Small constant offset
$α$	Prioritization hyperparameter
$P (i)$	Sampling probability
$w_{i}$	Importance sampling weights
$β$	Bias correction parameter
$p_{m}, p_{g}$	Suction cup and target coordinates
$L_{s}$	Horizontal distance to target
${\bar{ζ}}_{d}, {\bar{ζ}}_{α}$	Distance and angle tolerances
$H_{m}, H_{g}$	Heights of suction cup and target
$η_{h}, {\tilde{ζ}}_{h}$	Target height and its tolerance
$\hat{α}$	Angular deviation from normal

CD-MADDPG-based control strategy for robotic manipulator

Markov decision analysis

In our multi-robot system, each manipulator operates as an autonomous agent within the shared environment. At every time step $t$ , agent $i$ perceives its state $s_{i}$ , executes action $a_{i}$ , and transitions to state $s_{i}^{t + 1}$ while receiving an environmental reward. The formal definitions of the state space $S (t)$ , observation space $O (t)$ , action space $A (t)$ , and reward function $R (t)$ are provided below:

State $s (t)$ : The state information of the manipulator and the target object is defined as follows. The manipulator state $s_{M}$ is defined as:

s_{M} = (x_{M}, y_{M}, z_{M}, θ_{end})

(7)

where

(x_{M}, y_{M}, z_{M})

are the 3D Cartesian coordinates of the end-effector (in cm), and

θ_{end}

is the angle between the end-effector axis and the normal vector of the

X

–

Y

plane (in radians), ranging from

[0, π / 2]

, with

0

indicating perfectly vertical downward orientation. The target object state

s_{G}

is defined as:

s_{G} = (x_{G}, y_{G})

(8)

where

(x_{G}, y_{G})

are the coordinates of the target object in the

X

–

Y

plane (in cm).

Observation $o (t)$ : During centralized training, the value network requires the observational information $O (t)$ from all agents, which comprises the full state information, It can be expressed as:

O (t) = {s_{0}, \dots, s_{M}, s_{g}, \dots, s_{G}} .

(9)

Action space $a (t)$ : The UR5 robotic arm used in this system model has 6 degrees of freedom. Accordingly, We treat each joint as an intelligent agent, hence their action is:

a_{i} (t) = {θ_{1}}, i \in [1, 6],

(10)

where each

θ_{i} \in [0, 1]

represents the normalized input value for the corresponding joint. Due to kinematic coupling among the joints, varied combinations of these angles enable rich end-effector trajectory and pose adjustments.

To guide the manipulator in completing the suction operation, we design a composite reward function comprising horizontal distance, angular deviation, and height deviation. The specific definitions of each component are as follows.

Horizontal distance reward: Let the coordinates of the end-effector center be $(x_{m}, y_{m})$ and those of the target object center be $(x_{g}, y_{g})$ ; then the horizontal distance is $L_{s} = \sqrt{(x_{m} - x_{g})^{2} + (y_{m} - y_{g})^{2}}$ . The distance reward is defined as:

r_{dist} = {\begin{matrix} R_{max} + 10, & 0 \leq L_{s} \leq L_{succ} \\ R_{max} - L_{s}, & L_{succ} < L_{s} \leq L_{bound} \\ - (L_{s} + L_{bound}), & L_{s} > L_{bound} \end{matrix}

(11)

where

R_{max} = 30

is the maximum positive reward,

L_{bound} =

30 cm is the virtual boundary length, and

L_{succ} =

2 cm is the success distance threshold. When the end-effector enters the success region, the maximum reward is given; within the boundary, the reward decreases linearly with distance; beyond the boundary, a negative penalty is imposed.

Angular deviation reward: Let the angle between the end-effector axis and the normal vector of the target surface be $α$ (in radians). The angular reward is defined as:

r_{angle} = {\begin{matrix} R_{max} \cdot \frac{α}{π}, & α \geq 0 \\ 0, & α < 0 \end{matrix}

(12)

This design encourages the end-effector to point downward ( $α > 0$ indicates a downward tilt), with a smaller angle yielding a higher reward. The success angular threshold is $α_{succ} = 10 \circ$ .

Height deviation penalty: Let the height of the end-effector be $z_{m}$ and the target surface height be $h_{target}$ (in this paper, $h_{target} = η_{h} =$ 10 cm). The height penalty is defined as:

r_{height} = - \sqrt{(z_{m} - h_{target})^{2}}

(13)

This penalty term encourages the end-effector to remain within a height range near the target surface.

Total reward: The total reward is obtained as the weighted sum of the three terms:

r_{total} = r_{dist} + r_{angle} + λ \cdot r_{height}

(14)

where

λ = 0.01

is the weight coefficient for the height penalty. This value is determined based on physical dimensional analysis: the magnitudes of

r_{dist}

and

r_{angle}

are approximately

30

, while that of

r_{height}

is in the range of

0

–

10

; multiplying by

0.01

brings the contributions of the three terms to the same order of magnitude, preventing the height term from dominating the optimization.

Curiosity-driven experience replay mechanism

Standard MADDPG employs a uniform sampling strategy that overlooks the non-uniform distribution of learning value among experiences. While TD-error-based PER mitigates this, it often suffers from insufficient exploration in sparse-reward environments. To address this issue, we introduce a CD-PER mechanism¹⁶ and integrate it with MADDPG. By leveraging the prediction residual of a forward dynamics model to quantify sample novelty, CD-PER assigns higher replay weights to transitions with larger errors, which indicate under-fitted dynamical features. This shift from value-based to cognition-based prioritization significantly enhances sample efficiency in complex tasks.

Forward dynamics model: The forward dynamics model is implemented as a neural network $f_{ϕ}$ , parameterized by $ϕ$ . It takes the current state $s_{t}$ and the executed action $a_{t}$ as inputs to predict the subsequent state ${\hat{s}}_{t + 1}$ :

{\hat{s}}_{t + 1} = f_{ϕ} (s_{t}, a_{t})

(15)

The model undergoes self-supervised learning by minimizing the Mean Squared Error between the predicted state and the ground-truth observation

s_{t + 1}

. The corresponding loss function is defined as:

L_{f o r w a r d} (ϕ) = \frac{1}{2} ‖ {\hat{s}}_{t + 1} - s_{t + 1} ‖_{2}^{2}

(16)

This prediction error, denoted as

e_{t}

, represents the Curiosity Signal of the agent:

e_{t} = ‖ f_{ϕ} (s_{t}, a_{t}) - s_{t + 1} ‖_{2}^{2} .

(17)

Priority metric criteria: During the experience replay process, the priority $p_{i}$ of the $i$ th transition is determined by the intensity of its corresponding curiosity signal:

p_{i} = (e_{i} + ϵ)^{α},

(18)

where

ϵ

is a small positive constant to ensure that transitions with zero prediction error still have a non-zero probability of being sampled. The hyperparameter

α \in [0, 1]

controls the degree of prioritization, balancing between uniform and prioritized sampling.

Sampling probability and bias correction: The probability $P (i)$ of selecting transition $i$ is proportional to its priority relative to the total buffer capacity:

P (i) = \frac{p_{j}}{\sum_{k} p_{k}}

(19)

Since non-uniform sampling introduces estimation bias by altering the data distribution, Importance Sampling (IS) weights

w_{i}

are utilized to correct the gradient updates:

w_{i} = {(\frac{1}{N \cdot P (i)})}^{β}

(20)

where

N

denotes the current size of the replay buffer, and

β

is a hyperparameter that linearly anneals from its initial value to 1 over the course of training.

DRL algorithm

This work introduces a MADDPG algorithm employing the Centralized Training with Decentralized Execution (CTDE) paradigm, designed to balance system-wide coordination with individual agent performance in multi-robot arm systems.

The policy gradient for each agent is formulated as:

\nabla J (θ_{i}) = E_{s, a \sim D} [\nabla_{θ_{i}} π_{i} (a_{i} ∣ s_{i}) Q_{i}^{μ} (o, a) |_{a_{i} = π_{i} (o_{i})}]

(21)

where

D

represents the experience replay buffer,

o = {o_{1}, \dots, o_{M}}

denotes the global observation, and

Q_{i}^{μ} (\cdot)

is a centralized action-value function that estimates agent

i

’s

Q

-value using all agents’ states and actions. The action

a_{i} = π_{i} (o_{i})

is generated by agent

i

’s policy network

π_{i}

Each agent’s value network is optimized via the loss function:

L (θ_{i}) = E_{s, a, r, s^{'}} [{(Q_{i}^{μ} (o, a_{1}, \dots, a_{M}) - \bar{y})}^{2}]

(22)

with target value

\bar{y}

defined as:

\bar{y} = r_{i} + {γ Q_{i}^{μ^{'}} (s^{'}, a_{1}^{'}, \dots, a_{M}^{'}) |}_{a_{j}^{'} = π_{j}^{'} (s_{j}^{'})}

(23)

Here,

π^{'} = (π_{1}^{'}, \dots, π_{M}^{'})

represents the target policy networks, and

r_{i}

is the reward obtained by agent

i

. The complete training procedure is outlined in Algorithm 1.

Algorithm 1.

CD-PER-MADDPG Training Procedure.

Require:

N

agents, batch size

K

, parameters

α, β

1: Init

μ_{θ_{i}}, Q_{ψ_{i}}, f_{ϕ_{i}}

and target nets for all agents

i

2: Init prioritized buffer

D

(SumTree)

3: For episode

= 1

M

4: Observe initial state

s

5: For

t = 1

T

6: Execute

a

, observe

r

and next state

s^{'}

7: Compute residual:

e_{i} = ‖ f_{ϕ_{i}} (s_{i}, a_{i}) - s_{i}^{'} ‖^{2}

8: Set initial priority:

p_{t} = (\sum_{i = 1}^{N} e_{i} + ϵ)^{α}

9: Store

(s, a, r, s^{'}, d o n e)

with

p_{t}

D

10: If updating conditions are met then

11: Sample

K

samples with

P (j) = p_{j} / \sum p

12: Compute IS weights:

w_{j} = (| D | \cdot P (j))^{- β}

13: For each agent

i = 1

N

14: Update

ϕ_{i}

by minimizing

L (ϕ_{i})

15: Update

ψ_{i}

by minimizing

L (ψ_{i})

16: Update

θ_{i}

via policy gradient

17: End For

18: Update

p_{j}

D

using new residuals/TD-errors

19: Soft update:

θ_{i}^{'} \leftarrow τ θ_{i} + (1 - τ) θ_{i}^{'}, \dots

20: End If

21:

s \leftarrow s^{'}

22: End For

23: End For

Analysis of the joint-decoupled multi-agent formulation

In this work, each joint of the robotic arm is modeled as an independent agent. This section provide a formulation and analysis of how this decoupling affects coordination and learning.

Following the standard factorization framework in MADRL,³⁵ we model the joint policy of the 6-DOF manipulator as the product of individual policies for each joint agent:

π (a | s) = \prod_{i = 1}^{6} π_{i} (a_{i} | o_{i})

(24)

where

a = [a_{1}, a_{2}, \dots, a_{6}]

denotes the joint action,

a_{i} = θ_{i}

is the angle command for the

i

th joint, and

o_{i}

is the local observation of agent

i

Following the sequential structure insight proposed by Ramesh et al.,³⁶ multi-link manipulators possess an inherently decoupled nature, where the state and action spaces grow linearly with the number of links. Based on this, we formulate the joint action space as the Cartesian product of each joint’s action space:

A = A_{1} \times A_{2} \times \dots \times A_{6}

(25)

where

A_{i} = [0, 1]

is the normalized angle space for the

i

th joint, and each subspace is governed by an independent agent.

During centralized training, the Critic network of each agent receives the actions of all joints along with the global observation:³⁷

Q_{i}^{μ} (o, a_{1}, a_{2}, \dots, a_{6}) = E [\sum_{t = 0}^{\infty} γ^{t} r_{i}^{t}]

(26)

This design enables the Critic to explicitly model the coupling relationships among joints, thereby ensuring the coordination of decoupled actions. The policy gradient for each agent is computed via the centralized Critic, guaranteeing global optimization of the decoupled actions:³⁸

\nabla J (θ_{i}) = E_{s, a \sim D} [\nabla_{θ_{i}} π_{i} (a_{i} | o_{i}) \cdot Q_{i}^{μ} (o, a_{1}, \dots, a_{6})]

(27)

Experimental design and analysis

Simulation parameters

To evaluate the performance of our proposed method, systematic experiments were carried out in a PyBullet-based simulation environment serving as the training platform for robotic manipulation. The experimental setup consisted of multiple UR5 robotic arms and target objects modeled as cylinders with dimensions of 5 cm in diameter and 3 cm in height, each centered at their geometric centroid. To prevent interference with the robotic arm base positioned at the origin, target objects were placed within the coordinate range of [20, 80] along both x and y axes. In the PyBullet environment, the robotic arm model is described using a URDF file, which includes a $<$ Collision $>$ tag and corresponding collision geometries. When internal collisions occur within the robotic arm, the system returns collision information—including both internal and ground collisions—through interfaces such as p.getContactPoints. After each simulation step, we check for collisions, and any detected collision is considered a task failure. Detailed parameters are listed in Table 2.

Table 2.

Lain of primary simulation parameters.

Parameter	Setting
Actor learning rate	1 $\times 10^{- 5}$
Critic learning rate	5 $\times 10^{- 3}$
Total training episodes	35,000
Steps per episode	100
Virtual boundary length ( $L_{bound}$ )	30 cm
Success distance threshold ( $L_{succ}$ )	2 cm
Success angle threshold ( $α_{succ}$ )	$\leq 10 \circ$
Target height offset ( $H_{succ}$ )	10 cm
Manipulator base coordinates	(0, 0)

CD-MADDPG experimental analysis

In this section, an experiment employing a single robotic arm and a target object is conducted. The results demonstrate the performance of algorithms in terms of success rate, reward, angle, and distance.

DDPG: This algorithm refers to the application of DDPG in a multi-agent environment for decoupled robotic manipulators.

MADDPG: The original MADDPG algorithm employs a CTDE approach to train individual agents.

DDPG-Single: The single-agent DDPG algorithm is applied to an environment with non-decoupled robotic manipulators, serving as the primary benchmark algorithm in this study.

CD-MADDPG: An enhanced MADDPG algorithm that integrates PER with a corrective mechanism (CDPER).

Figure 5 illustrates the convergence characteristics of the success rates for the three evaluated algorithms. It can be observed that CD-MADDPG achieves the best overall performance, with its success rate peaking at approximately 13,000 episodes, subsequently maintaining a high-level fluctuation between 80% and 100%, and ultimately converging at $96 % \pm 8 %$ . Meanwhile, MADDPG exhibits a steady learning trajectory, with its success rate gradually increasing and finally converging at $49 % \pm 13 %$ . In contrast, DDPG demonstrates the least favorable performance. Owing to its inherent limitations in handling the non-stationarity inherent in multi-agent environments, DDPG fails to show a clear learning trend, with its success rate fluctuating stochastically between 0% and 30% and converging to an average of $13 % \pm 8 %$ . As the benchmark for performance comparison, DDPG-Single achieves a success rate convergence of $77 % \pm 12 %$ , representing a 19% improvement over the proposed algorithm. Figure 6 illustrates the convergence trends of cumulative rewards for the three evaluated algorithms throughout the training process. During the initial rapid ascent phase, CD-MADDPG exhibits the most prominent performance, significantly outperforming the other two algorithms in terms of both convergence speed and learning efficiency. In the steady ascent phase, the DDPG algorithm demonstrates the least favorable reward performance, fluctuating at a low level between 40 and 50 after 20,000 episodes, and ultimately converging to $46.14 \pm 3.26$ . To facilitate a clearer comparison, an inset figure is used to display the steady-state rewards of the remaining two algorithms, which exhibit relatively minor fluctuations. The magnified view reveals that CD-MADDPG converges to approximately $58.60 \pm 1.32$ , followed by DDPG-Single with a reward convergence of $56.70 \pm 0.68$ , while the original MADDPG converges to $55.09 \pm 0.67$ . These findings effectively validate the superiority of the proposed improved algorithm in accelerating convergence and optimizing steady-state performance.

Figure 6.

Convergence of reward for various algorithms.

Figure 7 illustrates the convergence trends of the minimum distance for the three evaluated algorithms during single-arm manipulation tasks, a metric highly correlated with the convergence characteristics of the cumulative reward. In terms of the initial descent rate, CD-MADDPG significantly outperforms the other three algorithms, demonstrating superior learning efficiency, followed by DDPG-Single, then MADDPG, while DDPG exhibits the least favorable performance. It can be observed that the distance metric of DDPG fails to achieve effective learning, converging to $11.33 \pm 2.90 cm$ after an initial decline, primarily due to its inherent limitations in efficiently learning within multi-agent environments. The magnified inset provides a detailed view of the steady-state performance: the minimum distance of CD-MADDPG stabilizes between 1.5 and 2.0 cm, ultimately converging to $1.45 \pm 0.47 cm$ , closely followed by DDPG-Single with a convergence value of $2.59 \pm 0.41 cm$ . The proposed algorithm achieves a 44% improvement in distance accuracy compared to DDPG-Single. Meanwhile, MADDPG fluctuates within the range of 4.0– 5.0 cm, ultimately converging to $4.62 \pm 0.58 cm$ . These differences precisely echo the discrepancies observed in reward convergence, further validating the superiority of the proposed algorithm in enhancing positioning accuracy. Figure 8 illustrates the convergence trends of the angular deviation between the end-effector vector and the normal vector of the $X$ – $Y$ plane for the three evaluated algorithms. It can be observed that the DDPG algorithm exhibits the least favorable orientation control, with its angular error fluctuating significantly between 0and 20, ultimately converging to $7.59 \circ \pm 5.44 \circ$ . In contrast, the magnified inset reveals that CD-MADDPG, DDPG-Single, and MADDPG demonstrate nearly identical convergence characteristics regarding the angular metric, all maintaining stable fluctuations within the range of 0to 2, which satisfies the operational requirements for suction manipulation, converging to $0.83 \circ \pm 0.68 \circ$ , $0.87 \circ \pm 0.63 \circ$ , and $0.88 \circ \pm 0.51 \circ$ , respectively. These results suggest that CD-MADDPG, DDPG-Single, and MADDPG all prioritize the optimization of orientation accuracy. Re-examining the reward curves in Figure 8, it can be inferred that the performance gap in cumulative rewards between these two algorithms primarily stems from their varying degrees of optimization in the shortest distance metric. Figure 9 illustrates the convergence characteristics of the average end-effector height for the three evaluated algorithms. It can be observed that CD-MADDPG, DDPG-Single, and MADDPG all significantly outperform the DDPG algorithm in terms of convergence speed regarding the height metric; all three converge to approximately 10 cm (aligning with the reserved height requirement) around the 5,000th episode, achieving convergence values of $10.10 \pm 0.57$ , $10.46 \pm 0.59$ , and $9.07 \pm 0.79$ , respectively. This rapid convergence trend is consistent with the aforementioned performance in the minimum distance metric. In contrast, the DDPG algorithm exhibits a notable performance lag, fluctuating between 10 and 15 cm after 20,000 episodes, and ultimately converging to $15.43 \pm 3.32$ . This observation aligns with the convergence bottleneck observed in its reward curve during the same stage, indicating its difficulty in achieving further precision optimization.

Figure 7.

Convergence of minimum distance for various algorithms.

Figure 8.

Convergence of the angular error for various algorithms.

Figure 9.

Convergence of height for various algorithms.

Conclusion

This paper focuses on the location problem of decoupled robotic arms in achieving suction-based grasping and transportation tasks within environments featuring randomly generated target positions. By integrating a curiosity-driven experience replay mechanism, we propose the CD-MADDPG algorithm, which leverages prediction residuals from state transitions generated by a forward dynamics model to evaluate the exploration value of samples, and decouples the robotic arm to reduce the dimensionality of actions. Simulation results demonstrate that CD-MADDPG outperforms baseline algorithms, achieving a 19% improvement in success rate and a 44% improvement in distance accuracy. Future work will focus on deploying the proposed algorithm onto physical robotic platforms and further investigating its robustness in more complex environments containing dynamic obstacles. In future work, we plan to build upon this study by introducing various influencing factors, such as environmental obstacles, to conduct more comprehensive experiments.

Footnotes

Author contributions

Conceptualization, Y.A. and W.W.; methodology, Y.A. and Z.S.; software, Z.L. and Y.A.; validation, Z.L., W.W. and Y.L.; formal analysis, W.W. and Z.S.; investigation, Y.A. and Z.L.; resources, Y.L. and Z.S.; data curation, Y.A. and Z.L.; writing—original draft preparation, Y.A. and W.W.; writing—review and editing, Y.L. and Z.S.; visualization, Z.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Natural Science Foundation of Fujian Province, China (Grant No. 2021J011112, 2023J011011); the Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (Grant No. MJUKF-IPIC202406); the Science and Technology Bureau of Putian, Fujian Province, China (Grant No. 2022GZ2001ptxy11, 2024GZ3001PTXY19); the Startup Fund for Advanced Talents of Putian University (Grant No. 2024039); and the Postgraduate Research Project of Putian University (Grant No. yjs2024033).

ORCID iD

Yuanmo Lin

Declaration of conflicting interests

The authors declare that there is no conflict of interest with respect to the research, authorship, and/or publication of this article.

References

Soori

Dastres

Arezoo

, et al. Intelligent robotic systems in industry 4.0: A review. J Adv Manufact Sci Technol 2024; 4: 2024007.

Arinez

Chang

Gao

, et al. Artificial intelligence in advanced manufacturing: Current status and future outlook. J Manufact Sci Eng 2020; 142: 110804.

Tantawi

Fidan

Huseynov

, et al. Advances in industry 4.0: from intelligentization to the industrial metaverse. Int J Interact Design Manufact (IJIDeM) 2025; 19: 1461–1472.

Deng

Guo

Wei

, et al. Deep reinforcement learning for robotic pushing and picking in cluttered environment. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2019, pp.619–626. DOI: 10.1109/IROS40897.2019.8967899.

Sekkat

Moutik

Ourabah

, et al. Review of reinforcement learning for robotic grasping: Analysis and recommendations. Stat Optimiz Inform Comput 2024; 12: 571–601.

Spahn

Brito

Alonso-Mora

. Coupled mobile manipulation via trajectory optimization with free space decomposition. In: 2021 IEEE International conference on robotics and automation (ICRA), 2021, pp.12759–12765. DOI: 10.1109/ICRA48506.2021.9561821.

Sadiq

Raheem

Abbas

NAF

. Robot arm trajectory planning optimization based on integration of particle swarm optimization and a* algorithm. J Comput Theor Nanosci 2019; 16: 1046–1055.

Jia

, et al. Trajectory planning of robotic arm based on particle swarm optimization algorithm. Appl Sci 2024; 14: 8234.

Anca

Studley

. Twin delayed hierarchical actor-critic. In: 2021 7th International conference on automation, robotics and applications (ICARA), 2021, pp.221–225. DOI: 10.1109/ICARA51699.2021.9376459.

10.

Newbury

Chumbley

. Deep learning approaches to grasp synthesis: A review. IEEE Transactions on Robotics 2023; 39: 3994–4015.

11.

Ferreira

Barbosa

. Deep reinforcement learning for adaptive robotic grasping and post-grasp manipulation in simulated dynamic environments. Future Internet 2025; 17: 437.

12.

Cao

Bai

. Multi-agent deep reinforcement learning-based robotic arm assembly research. PLoS ONE 2025; 20: 1–35.

13.

Huang

Liu

, et al. A novel robotic grasping method for moving objects based on multi-agent deep reinforcement learning. Robot Comput Integr Manuf 2024; 86: 102644.

14.

Jia

Zhu

, et al. Multi-fingered hand grasps with visuo-tactile fusion via multi-agent deep reinforcement learning. Proc AAAI Confer Artif Intell 2025; 39: 14594–14601.

15.

Bonyani

Soleymani

Wang

. Multi-agent deep reinforcement learning for variable-finger dexterous grasping through multi-stream embedding fusion. In: ICRA 2025 Workshop “Handy Moves: Dexterity in multi-fingered hands” paper submission.

16.

Pathak

Agrawal

Efros

, et al. Curiosity-driven exploration by self-supervised prediction. In: Precup D and Teh YW (eds.) Proceedings of the 34th international conference on machine learning, Proceedings of Machine Learning Research, Vol. 70, pp.2778–2787. PMLR.

17.

Dai

Zhang

. A novel framework for trajectory planning in robotic arm developed by integrating dynamical movement primitives with particle swarm optimization. Sci Rep 2025; 15: 29656.

18.

Zhang

Peng

. Optimal time-optimal trajectory planning for robotic arms based on improved particle swarm optimization. In: 2025 7th International conference on civil engineering, environment resources and energy materials (CCESEM 2025), pp.473–482. Atlantis Press.

19.

Liu

. An evaluation of ddpg, td3, sac, and ppo: Deep reinforcement learning algorithms for controlling continuous system. In: 2023 International conference on data science, advanced algorithm and intelligent computing (DAI 2023), pp.15–24. Atlantis Press. DOI: 10.2991/978-94-6463-370-2_3.

20.

Wang

. Ep-ddpg: A deep reinforcement learning framework for visual-tactile fusion in grasping control of humanoid robotic hand. Biomed Signal Process Control 2026; 119: 109897.

21.

Hazem

Saidi

Guler

, et al. Reinforcement learning-based intelligent trajectory tracking for a 5-dof mitsubishi robotic arm: comparative evaluation of ddpg, lc-ddpg, and td3-adx. Int J Intell Robot Appl 2025; 9: 1982–2002.

22.

Tutsoy

Erol Barkana

Colak

. Learning to balance an NAO robot using reinforcement learning with symbolic inverse kinematic. Trans Inst Measure Control 2017; 39: 1735–1748.

23.

Hou

Cai

Iida

, et al. A quantitative comparison of centralised and distributed rl-based control for soft robotic arms. IEEE Robot Autom Lett 2025; 10: 3580–3587.

24.

Chen

Oyekan

. A deep multi-agent reinforcement learning framework for autonomous aerial navigation to grasping points on loads. Rob Auton Syst 2023; 167: 104489.

25.

Low

Zhou

. Cooperative multi-agent reinforcement learning for robotic systems: A review. Multiag Grid Syst 2025; 21: 96–123.

26.

Waseem

Chang

. From nash q-learning to nash-MADDPG: Advancements in multiagent control for multiproduct flexible manufacturing systems. J Manufact Syst 2024; 74: 129–140.

27.

Dong

Gong

. Curiosity-tuned experience replay for wargaming decision modeling without reward-engineering. Simul Model Pract Theory 2023; 129: 102842.

28.

Xue

Chen

Zhang

. Action-curiosity-based deep reinforcement learning algorithm for path planning in a nondeterministic environment. Intell Comput 2025; 4: 0140.

29.

Gregor

Spalek

. Curiosity-driven exploration in reinforcement learning. In: 2014 ELEKTRO, 2014, pp.435–440. DOI: 10.1109/ELEKTRO.2014.6848933.

30.

Tutsoy

Barkana

Balikci

. A novel exploration-exploitation-based adaptive law for intelligent model-free control approaches. IEEE Trans Cybern 2023; 53: 329–337.

31.

Wang

Liu

Shi

, et al. Efficient stacking and grasping in unstructured environments. J Intell Robot Syst 2024; 110: 57.

32.

Sun

Mao

Kong

, et al. A review of embodied grasping. Sensors 2025; 25: 852.

33.

Kebria

Al-wais

Abdi

, et al. Kinematic and dynamic modelling of ur5 manipulator. In: 2016 IEEE International conference on systems, man, and cybernetics (SMC), 2016, pp.004229–004234. DOI: 10.1109/SMC.2016.7844896.

34.

Chen

, et al. A spatial biarc method for inverse kinematics and configuration planning of concentric cable-driven manipulators. IEEE Trans Syst Man, Cybernet Syst 2022; 52: 4177–4186.

35.

Lee

Jeon

Kim

. Learning humanoid arm motion via centroidal momentum regularized multi-agent reinforcement learning. IEEE Robot Autom Lett 2025; 10: 11968–11975.

36.

Ramesh Kumar

Hosangadi

. Malleable agents for re-configurable robotic manipulators. arXiv preprint arXiv:220202395 2022.

37.

Lowe

Tamar

, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Neural information processing systems (NeurIPS).

38.

Jiang

, et al. Rethinking bimanual robotic manipulation: Learning with decoupled interaction framework. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2025. pp.12427–12437.