Abstract
A robot demonstration method is proposed based on the combination of locally weighted regression(LWR) and Q-learning algorithm. It is applied on a 6-DOF hitting-ball-system. This method can adapt to the work task by learning from demonstration and generating new actions. With the LWR algorithm, the mapping between target values and actions is established. According to deviation of landing position, a Q-learning algorithm is proposed to adjust the parameters of manipulator and compensate the errors caused by model and the controller. The model of LWR fits a local small space to approximate the global state and decision space. It turns out to reduce the dimension and simplify the training of Q-learning. The convergence rate is enhanced and the precision of performing task is improved. The simulation and experiment demonstrate the applicability of the proposed method.
Introduction
In human life and production, the robot application plays a more and more important role. People put a higher request for robot to adapt to the complex and changing environment. Robots need to be able to deal with problems from the environment and human interaction, and improve their adaptability of executing tasks. Imitation learning is a promising mechanism to learn new knowledge, which is also known as the program by demonstration(PbD). This method can reduce the complexity of the learning search space by observing the quality of the samples, which turns out to have great learning efficiency [1]. It is also an important research direction in the field of robotics [2–6].
The general learning form of imitating learning is action replication. Action replication is based on the action of the demonstrator, by solving the regression problem from the demonstration to the executing actions to determine the control strategy, and imitators implement control strategies for action reproduction to achieve imitation learning. The traditional demonstration process divides the perceptual information into discrete sub-targets and point-to-point motor operations so that the demonstration task is decomposed into a series of state-action-state transformation [7].
The demonstration actions belong to the discrete state space. In order to improve the adaptability of the robot to the environment, PbD uses more and more machine learning methods such as Gaussian process method to solve the generalization problem of action learning [8, 9]. Gaussian process method(GPM) is a nuclear learning machine with probabilistic meaning, which can give a probabilistic interpretation of the predicted output. It is based on the assumption that the observations and predictions are subject to a joint normal distribution, then the posterior distribution of the predictions would be obtained by solving the covariance matrix of the observations and the input of the training set. GPM has been applied to the regression and classification problems successfully [10, 11]. However, there are obvious drawbacks in the GPM. Due to the complexity of computation is large, the computational efficiency and real-time performance is poor, especially when processing large amounts of data. Besides, GPM is restricted to the strict assumption of Gaussian noise distribution, which means that when the actual model does not follow the Gaussian distribution, GPM performs poorly.
The locally weighted linear regression(LWR) learning method is a kind of data learning algorithm which is widely used in the robot adaptive control system and performs well on dealing with complex control problems. LWR is a memory-based learning method. Its basic idea is to construct a local model based on the stored empirical data points, which are weighted by a distance function. The weighted distance function ensures that the points in the empirical data points that are near to the input point have a greater effect on the local model. LWR fits well in the training region. In addition, it hardly depends on the choice of features and is able to train a good fitting model only with a simple linear model [12–14].
However, considering the errors and uncertainties of the system and the model, it is not enough to predict the parameters of the robot by the simple prediction model. We need to add other learning algorithms to increase the robustness of the model. Some study focus on some self-learning method, such as strategy gradient search and inverse reinforcement learning(IRL) [15–17]. Inverse reinforcement learning method is mainly to obtain the optimal cost function when the imitator imitates to learn the demonstration behavior and search for control strategy, which minimums the cost function. But, both policy gradient search and IRL need to collect plenty of training data which consumes a great deal of time and labor. In the process of robot demonstration programming, Guenter used reinforcement learning find the optimal parameters of a Gaussian mixture model(GMM) which indicated the advantage of the reinforcement learning that it can handle a lot of different and unexpected situations [18]. Reinforcement learning is to solve problems that an autonomous agent which is capable of observing the state of its environment and taking a series of actions to change those states, and the agent can achieve their goals by learning to choose the optimal action. With the deep research on the reinforcement learning method, it is more and more mature in dealing with the interaction between the robot and the environment and the motion prediction problems [19–23].
In this paper, in order to demonstrate the manipulator to perform the hitting ball task, the system are required to generalize new actions according to the observed new target position(environment). A simple non-linear predict model may not satisfy the accuracy since the errors and uncertainty of model and controller. Reinforcement learning has advantage of dealing with uncertain and unexpected situations. However, due to the huge sample space, the training time of reinforcement learning is too long and the convergence rate is slow. It seems difficult to apply the reinforcement learning method directly to the real robot system. We propose a control method combining the LWR and reinforcement learning, which has a good performance on precision and require less time for the training process. In the next section, related work in the action learning and regenerating is summarized.
Related work
Traditionally, machine learning(ML) technique is used to solve the problem of generating new actions. Muench proposed the ML techniques to generalize demonstrate actions and design an adaptive controller to regenerate the task according to the new environment [8]. However, the learning system was simple and could not perform the execution autonomously. Based on the mixture of motor primitives, in the table tennis robot experiment, the Gaussian process regression was used to establish the mapping between the parameters of robot and the original characteristic of the motor. A probabilistic model was given to predict the state parameters of the robot [9]. Since the drawbacks of GPM, the performance of the model is poor in some cases.
In recent year, LWR learning method is widely used in the robot adaptive control system and performs well on dealing with complex control problems. In [12–14], LWR learning method was used to establish the mapping between environment and robot parameters. In batting game experiments, Matsushima established the mapping between the robot parameters and motion of the ball, and proposed a LWR-based method to generalize new actions and then to compensate the error of the servo system by using the Iterative learning control method. In [13], Schaal improved the traditional LWR, which allows it to perform well on high-dimension space, and the feasibility of the method is verified by a 7-DOF manipulator. On the platform of table tennis robot, Huang firstly used LWR as a “lazy learning", and then added the FCMAC to the LWR algorithm as an “active learning”, which can update the experience database online, and fine-tuning the parameters of the manipulator to improve the placement accuracy of the ball [14].
However, considering the errors and uncertainties of the system and the model, more complex learning methods are studied in some research. In [15], Peters discussed various methods of strategy gradient search. These methods have the ability of self-learning, which is proved to have good performance after being applied to demonstration of baseball batting experiments. In [16, 17], an IRL method was proposed to search control strategies for robot demonstration learning. Aiming at physical human-robot collaboration problem, Ghadirzadeh proposed a reinforcement learning framework based on data-efficient, robot learns tasks from its own motion sensor in unsupervised way [21]. In the grasp task based on the tactile information, Chebotar divided the problem into two steps. In the first step, a stable grasping predictor was trained according to the tactile information and grasping effect. In the second step, a reinforcement learning model was trained according to the tactile feedback information and the forecasting result of the predictive machine. The action of the second grasping was adjusted so that the success rate of grasping object is greatly improved [22]. In [23], Wang used the Q-Learning reinforcement learning method to deal with the humanoid robot fixed-point-multi-direction soccer ball kicking problem. The motion curves of the robot’s foot played as the action set while the discretized football falling regions were considered as the state set. The kicking task was regarded as a simple reinforcement learning that “Action acquisition status and directly access to the final reward value”. However, these methods require plenty of training data which consumes a great deal of time and labor. As a result, it is difficult to use them on the real robot directly.
Inspired by the approach in [22], we divide the problem into two steps. Firstly, the mapping between the two sets is established by using LWR learning method. Then, a Q-learning reinforcement learning algorithm is proposed which is added to the LWR, aiming to adjust the parameters of manipulator and compensate the error caused by model and the controller. In addition, this method also reduces the dimension of Q-learning sample space and decision space, which results to greatly reduce the Q-learning training time. The two algorithms complement each other which perform well on the demonstration of robot hitting ball experiment. Finally, the feasibility of the control method is verified by both simulation and experiment.
System description and modeling
As shown in Fig. 1, the hitting ball robot consists of experimental platform, a 6DOF UR manipulator with a bat and an industrial camera which is fixed above the platform. The optical axis of the camera is perpendicular to the experimental platform, and the ball is located at the fixed position on a hitting platform. The coordinate systems are also illustrated. {B} is the manipulator base coordinate system and {T} is the tool center position(TCP) coordinate system.

Hitting ball system.
During the experiment, in order to simplify the model and reduce the variables, the parameters of robot are selected to use the pose parameters and speed of the end-effector
The equivalent rotation vector
Construct the equivalent rotation matrix:
Equivalent rotation angle:
Equivalent rotation axis vector:
Predefined pose parameters:
{V} is the pixel plane coordinate system. The information collected by the industrial camera is shown in Fig. 2. In this experiment, the camera is not calibrated, and the pixel coordinate information is collected as the target observation set directly.

Information collected from camera.
LWR learning algorithm
LWR is a memory-based learning method. Its basic idea is to construct a local model based on the stored empirical data points, which are weighted by a distance function. Then a polynomial model is selected and the sample points are fitted by least square method(LSM). The LWR method combines the locally polynomial fitting and locally weighted to approximate the local function model near the prediction points according to the position of them, and has strong robustness.
Learning model parameters
From the second part, it is known that the action parameters of the manipulatorare selected as the pose parameters, position parameters and the velocity of the TCP

Coordinate system of TCP.
Furthermore, the position of the ball is expressed by the coordinate of the center of it which is collected by an industrial camera, and several more notations are introduced:
Where Dd and Da denote the desired landing position and actual landing position, respectively, and Dv denotes the deviation of the desired and actual landing position.
The landing position set
Training sets obtained by demonstration:
The relationship between input and output:
Where
Use MLSM to compute the parameters in LWR model. The cost function is:
Where
Where
Where τ is the bandwidth parameter which is used to control the rate of weights declining with distance. When τ is large, most of the datas is used for the training of the regression model, while τ is small, only a few local points are used to train the regression model.
Finally, the obtained parameters are expressed as:
The LWR method is described as:
LWR: Fit
Return
The essence of the LWR method is to approximate a local model to the input data point by assigning distance weights to all data points in the database. In the above section, the parameters of the model are determined by using LSM. However, matrix operation in this method is time-consuming, especially when the matrix dimension is large. In this paper, a local condition for LWR is defined to ensure the local fitting and reduce the matrix dimension, which is denoted as following:
Set the distance threshold. Only when the distance from the input point is less than the distance threshold is it considered as a local model point. Set the quantity limit. Find a few nearest data points away from the input data point in the database.
Thus, when the target pixel position is observed, the LWR model can be used to compute the initial action parameters of the robot hitting. When the empirical data are not sufficient, the LWR method only provides a rough solution, and there may be a large error in the hitting task results by using the action parameter obtained by this method. In the next section, a Q-learning reinforcement learning model would be trained to correct the action parameters further where the landing position error would be the input, so that precision of the robot hitting results would be improved.
Q-learning reinforcement learning
Reinforcement learning is to solve problems that an autonomous agent which is capable of observing the state of its environment and taking a series of actions to change those states. It achieves goals by learning to choose the optimal action. When the agent takes each action in its environment, the demonstrator provides reward or penalty information to indicate whether the status of the result is correct or not. The task of the agent is to learn from these indirect, delayed rewards, and the task of learning is to obtain a control strategy that can select the appropriate action from any initial state so that the agent can maximize rewards over time, see Fig. 4.

Reinforcement learning.
Q-learning algorithm is a reinforcement learning method similar to the dynamic programming algorithm proposed by Watkins in 1989, and learns the optimal action strategy by interacting with the environment. It considers the interaction between the agent and the environment as a Markov decision process(MDP) that the current state of the agent and the chosen action determine a fixed state transition probability distribution, the next state, and get an immediate reward. Q-learning defines an evaluation function Q(s, a) whose value is the maximum discount cumulative reward starting from the state s and the action a. Then the evaluation function is learned and the optimal strategy is selected according to it. The evaluation function is defined as follows:
Where the r(s, a) is the immediate reward of action a for state s; γ is the discount factor; δ(s, a) is the state transmitted from s after taking action a. Each Q(s, a) corresponds to a Q value, and in the learning process, the action is selected according to Q value. The expression of the optimal strategy is as follows:
The symbol
As long as the system is modeled as a deterministic Markov process, the reward function r is bounded and the action-select mechanism ensures each state-action pair could be infinitely accessed, the
As a model-free learning method, Q-learning can construct learning objects without system model which turns out to be convenient. However, in order to achieve good performance, Q-learning requires a lot of empirical data(sample data), and the learning efficiency needs to be improved. When the state space and the decision space are large, the training convergence is slow.
The structure of learning method
Based on the LWR learning method, the mapping between the set of landing position and the set of the manipulator action parameters is established. The local model is built to fit the input sample points, which has the ability of generalization. LWR method only relies on the stored empirical data points, and the number of empirical data has a great influence on the performance of LWR method. However, the multi-DOFmanipulator has a high dimension, which means that a large amount of empirical data is needed to train the LWR model and the learning efficiency will decrease. Therefore, the LWR method is only used as a rough estimate when the empirical data are not sufficient.
Q-learning reinforcement learning can increase the ability of local generalization that it can further correct the manipulator parameters according to the actual error after hitting. Reinforcement learning, that divides the state space and action space into small parts, is expected to enhance the learning system control accuracy and compensate the uncertainty and error of model and system. When reinforcement learning is used alone, there would be a problem of dimension disaster in action and decision space. Thus, we propose a combination learning method of the LWR and Q-learning, where LWR is firstly used to fit a local small model and then the Q-learning is used in this local small model which greatly reduce the dimension of the state and decision space ensuring the same division accuracy. As a result, the training times decreased, and the learning efficiency and convergence rate of Q-learning are improved as well. Figure 5 shows the architecture of the learning method

Architecture of the learning method.
When the target position is given, the desired landing position of ball is obtained according to the coordinate information. Taking these parameters as the input of the controller, the optimal initial manipulator parameters are predicted in the LWR model.
Then the first hit is performed and the error between the actual and desired landing position observed by the camera would betaken as input to the Q-learning reinforcement learning model, where the adjustment parameters of manipulator would be obtained as output. Finally, the action parameters of second hit are sum of the initial parameters and the adjustment parameters and thus the manipulator performs a more accurate hit.
In this paper, firstly, the method of LWR is used to make a rough estimation. Then suppose that the corresponding state transition function δ(s, a) of each point near the landing point obeys the same probability distribution, where the s represents the error of landing position and a represents the manipulator action correction parameters. The state space and the decision space are reduced from the whole original set of ball landing points and the manipulator actions set to the local small set of points near the landing point and the corresponding manipulator action set, which greatly reduces the dimension of the state and the decision space. The convergence time of training decrease as well.
For convenience, the state set
Where Vr is the reward value when the deviation of landing position equals 0; r is the boundary of state region; λ is a sufficiently large parameter which can guarantee that e-λΔd converges to 0 when Δd is equal to r; Δd is the deviation of landing position. The exponential reward function has such a characteristic that a significantly higher reward value is obtained in the area near the target point and a very small reward value is obtained in the area away from the target point. Compared to the linear reward function, it emphasizes the role of good training samples near the target location and weakens the disturbance of relatively poor training sample away from the target location more intensely. In this way, training result can be quickly converged to the actual Q function, and the convergence rate is improved. Algorithm 1 is adjusting parameters of hitting ball algorithm based on Q-learning.
For this algorithm, this chapter has the following explanations:
In Step 1, the action would be selected with uniform probability. In Step 4 and 5, this paper used ɛ–greedy heuristics, where ɛ = 0.01. Since the hitting ball task is a simple single-step reinforcement learning model that “actions obtain status and directly get the final reward,” which does not involve a multi-step Markov decision process, in Step 6, the value does not required foriterationof other states. In Step 9, when the execution of state In Step 12, parameter ξ ∈ R+ represents minimum Q value of the end of training some states.
The description of the hitting task experiment
The task of experiment is to hit the ball to the target position on the platform with abat that is on the end of the manipulator, where the target position is arbitrary within the working range of the camera field of view. At the beginning of the experiment, we demonstrate the manipulator to hit the position-fixed ball by using different poses and hitting speeds, and the landing position is observed by an industrial camera. The action parameter set and the target observation set are built as the training set. LWR method is used to establish the mapping between the two sets. During the experiment, the UR manipulator performs the hitting task using the action parameters predicted by the LWR method and the landing positionis obtained by the industrial camera. Through the error of actual observation and expected result, the Q-learning reinforcement learning model is trained, and the action parameters of the manipulator are further adjusted to improve the accuracy of the second hit.
In this paper, the simulation control system is set up in the ADAMS which is used to verify the method feasible and have good effect. Then, the experiment on the experimental platform built around the UR robot is performed to test the practical effect of the method.
Simulation of the control method
The training of LWR model
The LWR model is firstly trained. The ball is placed on a fixed position. Demonstrate manipulator to hit the ball from 19 different directions ranging from -45° to +45° every 5°, see Fig. 6. In each hitting direction, by adjusting the velocity value of the bat, which is ranging from 400mm/s to 1500mm/s, the landing position set takes 18 points in each direction. Record the data of each hit those are effective, including the landing position

Hit the ball from 19 different directions.
In order to further improve the accuracy of the hitting task, an reinforcement learning model was trained to output adjusted action parameters with the first-hit deviation of landing position as the input. Select(950, 0) as the reference point. Taking this point as the center, a local rectangular region(length of 30 mm, width of 24 mm) is expanded and divided into 15 small rectangles(length of 6 mm, width of 8 mm).In this region, training is performed according to Algorithm 1.The training process is shown in the Fig. 7, the higher the gray value, the higher the Q value of the corresponding region. And the black area indicates the Q value is 0, while the white area indicates that the Q value reaches the convergence condition.

Machine learning training process.
It can be seen from the Fig. 7 that, using the Q-learning method, the robot will update the Q value in the Q matrix by learning, trying and rewarding, which turns out to get an optimal strategy set. The set can determine the current state value according to the deviation of the first ball landing position and select the optimal action, that is, the adjustedaction parameters, to perform the second hit. Finally, after nearly 650 training sessions, the Q value in each state of the Q matrix convergence to the maximum value, and the training completes. Compared with the method in [23], which used Q-learning algorithm into the whole action and decision space, the number of training reinforcement learning model is significantly reduced in our method.
After the model training, in the working range of the system, 58 target position points are randomly selected to perform two hits. The first round hit is based on the LWR model, and the second round hit is based on the LWR model adding the Q-learning. As a contrast, GPM is used to predict the parameters of the robot based on the same dataset. The error analysis is shown in the Fig. 8.
As shown in Fig. 8, the horizontal coordinate is the hit sequence, and the vertical coordinate is the error of landing position. The Green solid line represents the hitting error based on the GPM. In 58 hits, the maximum error is 47.94 mm, the minimumerror is8.33 mm, and the average error is 23.59 mm. The variance of the erroris 8.49, which indicates that the fluctuation of error is very large, and in some situations, there would be poor predictions. The reason for these phenomena may bethat the observation data do not follow the multivariable Gaussian distribution strictly or the initial parameters are not reasonable. The blue solid line represents the hitting error based on the LWR model. In 58 hits, the maximum error is 16.12 mm, the minimum error is 3.61 mm, and the average error is 8.00 mm. The variance of the erroris 2.24, which is much smaller that variance of GPM. The figure also illustrates that, the error using LWR is generally smaller than the GPM. Using the same dataset, LWR shows better performance than GPM in our hitting ball simulation.

Error analysis.
The red solid line is the hit error of the LWR model adding the Q-learning reinforcement learning method. In 58 hits, the maximum error is 6.13 mm, the minimum error is 0.24 mm, and the average error is 2.74 mm. It can be seen from the figure that, in the same sequence of hitting simulation, the LWR model adding Q-learning has the smallest deviation, and the hitting precision is improved. Besides the overall average deviation of landing position is also smaller than the simple LWR method. Our control method has good performance.
Perform the hitting task experiment on the platform built around the UR. Firstly, 12 target points are randomly selected in the training range of LWR model, and the action parameters of the robot are calculated by LWR and LWR adding Q-learning respectively. The result of landing position and the calculated parameters of manipulator are shown in Tables 1 and 2.
Algorithm1
Target landing position and Reality landing position
Target landing position and Reality landing position
Parameters of manipulator
According to the tables, in the experiment of using LWR, the maximum error is 22.6 mm, while the minimum error is 17.1 mm, and the average error is 19.1 mm. On the other hand, in the experiment of using LWR adding Q-learning, the maximum error is 11.2 mm, while the minimum error is 5.6 mm, and the average error is 7.6 mm. It is obvious that LWR adding Q-learning has smaller errors and performs better. The distribution of the result is shown in Fig. 9.

The distribution of result.
In Fig. 9, the green point indicates the target landing position. The red point indicates the first hitlanding position by using LWR model, and the blue point indicates the second hit landing position by using LWR adding Q-learning method. The model of LWR has good training effect, but there is still possibility to continue to enhance the precision. And it is obvious that in each hit, the blue point is closer than the red one to the target point. The accuracy of the second hit has been significantly improved and the method of LWR adding Q-learning has better effect. The proposed method turns out to be feasible and effective.
A robot demonstration method is proposed based on the LWR and Q-learning reinforcement learning algorithm. This method can adapt to the work task by learning the demonstrate actions and generating new actions. The learning method consists of two parts. The first part is the learning model of multivariable locally weighted regression, which establishes the mapping relationship between the target value and the action. The second part is the Q-learning reinforcement learning model which outputs the adjustment parameters of action by using the error of the first step as input. The accuracy of the task is improved with the LWR and Q-learning algorithm. Besides, the regression learning fits a local small space to approximate the global state and decision space, which turns out to reduce the dimension and the number of training reinforcement learning. The convergence rate is improved as well. The feasibility and effectiveness of the proposed method are verified by establishing simulation on ADAMS. Areal hitting ball experiment on a 6-DOF UR robot is given and the validity of the algorithm is verified.
Future work
In this paper, our work is limited to a static initial state where the ball is located in a fixed position and the hitting pattern is predefined. In future work, we will further verify and enhance the adaptability of the method for using the manipulator to hit the ball in motion and learning the new hitting patterns. Besides, the model proposed only considered the kinematic model of the robot, and dynamic-related parameters were not taken into account. The method may be more robust by installing a force sensor on the end-effector, which would be helpful to collect the parameter information relative to the dynamic model.
Footnotes
Acknowledgments
This work was supported by the Science Fund Project No.61462089 of China’s NSF, the Scientific and Technological Research Program of Chongqing Municipal Education Commission No. KJ1501330, Chinese MIIT Intelligent Manufacturing and New Mode Application project “Application of new mode of intelligent manufacturing of Chinese medicine products”.
