Search results
1 – 10 of over 1000Many practical control problems require achieving multiple objectives, and these objectives often conflict with each other. The existing multi-objective evolutionary reinforcement…
Abstract
Purpose
Many practical control problems require achieving multiple objectives, and these objectives often conflict with each other. The existing multi-objective evolutionary reinforcement learning algorithms cannot achieve good search results when solving such problems. It is necessary to design a new multi-objective evolutionary reinforcement learning algorithm with a stronger searchability.
Design/methodology/approach
The multi-objective reinforcement learning algorithm proposed in this paper is based on the evolutionary computation framework. In each generation, this study uses the long-short-term selection method to select parent policies. The long-term selection is based on the improvement of policy along the predefined optimization direction in the previous generation. The short-term selection uses a prediction model to predict the optimization direction that may have the greatest improvement on overall population performance. In the evolutionary stage, the penalty-based nonlinear scalarization method is used to scalarize the multi-dimensional advantage functions, and the nonlinear multi-objective policy gradient is designed to optimize the parent policies along the predefined directions.
Findings
The penalty-based nonlinear scalarization method can force policies to improve along the predefined optimization directions. The long-short-term optimization method can alleviate the exploration-exploitation problem, enabling the algorithm to explore unknown regions while ensuring that potential policies are fully optimized. The combination of these designs can effectively improve the performance of the final population.
Originality/value
A multi-objective evolutionary reinforcement learning algorithm with stronger searchability has been proposed. This algorithm can find a Pareto policy set with better convergence, diversity and density.
Details
Keywords
Hamed Shahbazi, Kamal Jamshidi and Amir Hasan Monadjemi
The purpose of this paper is to model a motor region named the mesencephalic locomotors region (MLR) which is located in the end part of the brain and first part of the spinal…
Abstract
Purpose
The purpose of this paper is to model a motor region named the mesencephalic locomotors region (MLR) which is located in the end part of the brain and first part of the spinal cord. This model will be used for a Nao soccer player humanoid robot. It consists of three main parts: High Level Decision Unit (HLDU), MLR‐Learner and the CPG layer. The authors focus on a special type of decision making named curvilinear walking.
Design/methodology/approach
The authors' model is based on stimulation of some programmable central pattern generators (PCPGs) to generate curvilinear bipedal walking patterns. PCPGs are made from adaptive Hopfs oscillators. High level decision, i.e. curvilinear bipedal walking, will be formulated as a policy gradient learning problem over some free parameters of the robot CPG controller.
Findings
The paper provides a basic model for generating different types of motions in humanoid robots using only simple stimulation of a CPG layer. A suitable and fast curvilinear walk has been achieved on a Nao humanoid robot, which is similar to human ordinary walking. This model can be extended and used in other types of humanoid.
Research limitations/implications
The authors' work is limited to a special type of biped locomotion. Different types of other motions are encouraged to be tested and evaluated by this model.
Practical implications
The paper introduces a bio‐inspired model of skill learning for humanoid robots. It is used for curvilinear bipedal walking pattern, which is a beneficial movement in soccer‐playing Nao robots in Robocup competitions.
Originality/value
The paper uses a new biological motor concept in artificial humanoid robots, which is the mesencephalic locomotor region.
Details
Keywords
English original movies played an important role in English learning and communication. In order to find the required movies for us from a large number of English original movies…
Abstract
Purpose
English original movies played an important role in English learning and communication. In order to find the required movies for us from a large number of English original movies and reviews, this paper proposed an improved deep reinforcement learning algorithm for the recommendation of movies. In fact, although the conventional movies recommendation algorithms have solved the problem of information overload, they still have their limitations in the case of cold start-up and sparse data.
Design/methodology/approach
To solve the aforementioned problems of conventional movies recommendation algorithms, this paper proposed a recommendation algorithm based on the theory of deep reinforcement learning, which uses the deep deterministic policy gradient (DDPG) algorithm to solve the cold starting and sparse data problems and uses Item2vec to transform discrete action space into a continuous one. Meanwhile, a reward function combining with cosine distance and Euclidean distance is proposed to ensure that the neural network does not converge to local optimum prematurely.
Findings
In order to verify the feasibility and validity of the proposed algorithm, the state of the art and the proposed algorithm are compared in indexes of RMSE, recall rate and accuracy based on the MovieLens English original movie data set for the experiments. Experimental results have shown that the proposed algorithm is superior to the conventional algorithm in various indicators.
Originality/value
Applying the proposed algorithm to recommend English original movies, DDPG policy produces better recommendation results and alleviates the impact of cold start and sparse data.
Details
Keywords
Yanbiao Zou and Hengchang Zhou
This paper aims to propose a weld seam tracking method based on proximal policy optimization (PPO).
Abstract
Purpose
This paper aims to propose a weld seam tracking method based on proximal policy optimization (PPO).
Design/methodology/approach
By constructing a neural network based on PPO and using the reference image block and the image block to be detected as the dual-channel input of the network, the method predicts the translation relation between the two images and corrects the location of feature points in the weld image. The localization accuracy estimation network (LAE-Net) is built to update the reference image block during the welding process, which is helpful to reduce the tracking error.
Findings
Off-line simulation results show that the proposed algorithm has strong robustness and performs well on the test set of curved seam images with strong noise. In the welding experiment, the movement of welding torch is stable, the molten material is uniform and smooth and the welding error is small, which can meet the requirements of industrial production.
Originality/value
The idea of image registration is applied to weld seam tracking, and the weld seam tracking network is built on the basis of PPO. In order to further improve the tracking accuracy, the LAE-Net is constructed and the reference images can be updated.
Details
Keywords
Yangmin Xie, Qiaoni Yang, Rui Zhou, Zhiyan Cao and Hang Shi
Fast obstacle avoidance path planning is a challenging task for multijoint robots navigating through cluttered workspaces. This paper aims to address this issue by proposing an…
Abstract
Purpose
Fast obstacle avoidance path planning is a challenging task for multijoint robots navigating through cluttered workspaces. This paper aims to address this issue by proposing an improved path-planning method based on the distorted space (DS) method, specifically designed for high-dimensional complex environments.
Design/methodology/approach
The proposed method, termed topology-preserved distorted space (TP-DS) method, mitigates the limitations of the original DS method by preserving space topology through elastic deformation. By applying distinct spring constants, the TP-DS autonomously shrinks obstacles to microscopic areas within the configuration space, maintaining consistent topology. This enhancement extends the application scope of the DS method to handle complex environments effectively.
Findings
Comparative analysis demonstrates that the proposed TP-DS method outperforms traditional methods regarding planning efficiency. Successful obstacle avoidance tasks in the cluttered workspace validate its applicability on a physical 6-DOF manipulator, highlighting its potential for industrial implementations.
Originality/value
The novel TP-DS method generates a topology-preserved collision-free space by leveraging elastic deformation and shows significant capability and efficiency in planning obstacle-avoidance paths in complex application scenarios.
Details
Keywords
Hongbo Zhu, Minzhou Luo, Jianghai Zhao and Tao Li
The purpose of this paper was to present a soft landing control strategy for a biped robot to avoid and absorb the impulsive reaction forces (which weakens walking stability…
Abstract
Purpose
The purpose of this paper was to present a soft landing control strategy for a biped robot to avoid and absorb the impulsive reaction forces (which weakens walking stability) caused by the landing impact between the swing foot and the ground.
Design/methodology/approach
First, a suitable trajectory of the swing foot is preplanned to avoid the impulsive reaction forces in the walking direction. Second, the impulsive reaction forces of the landing impact are suppressed by the on-line trajectory modification based on the extended time-domain passivity control with admittance causality that has the reaction forces as inputs and the decomposed swing foot’s positions to trim off the forces as the outputs.
Findings
The experiment data and results are described and analyzed, showing that the proposed soft landing control strategy can suppress the impulsive forces and improve walking stability.
Originality/value
The main contribution is that a soft landing control strategy for a biped robot was proposed to deal with the impulsive reaction forces generated by the landing impact, which enhances walking stability.
Details
Keywords
Lei Yang, James Dankert and Jennie Si
The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation…
Abstract
Purpose
The purpose of this paper is to develop a mathematical framework to address some algorithmic features of approximate dynamic programming (ADP) by using an average cost formulation based on the concepts of differential costs and performance gradients. Under such a framework, a modified value iteration algorithm is developed that is easy to implement, in the mean time it can address a class of partially observable Markov decision processes (POMDP).
Design/methodology/approach
Gradient‐based policy iteration (GBPI) is a top‐down, system‐theoretic approach to dynamic optimization with performance guarantees. In this paper, a bottom‐up, algorithmic view is provided to complement the original high‐level development of GBPI. A modified value iteration is introduced, which can provide solutions to the same type of POMDP problems dealt with by GBPI. Numerical simulations are conducted to include a queuing problem and a maze problem to illustrate and verify features of the proposed algorithms as compared to GBPI.
Findings
The direct connection between GBPI and policy iteration is shown under a Markov decision process formulation. As such, additional analytical insights were gained on GBPI. Furthermore, motivated by this analytical framework, the authors propose a modified value iteration as an alternative to addressing the same POMDP problem handled by GBPI.
Originality/value
Several important insights are gained from the analytical framework, which motivate the development of both algorithms. Built on this paradigm, new ADP learning algorithms can be developed, in this case, the modified value iteration, to address a broader class of problems, the POMDP. In addition, it is now possible to provide ADP algorithms with a gradient perspective. Inspired by the fundamental understanding of learning and optimization problems under the gradient‐based framework, additional new insight may be developed for bottom‐up type of algorithms with performance guarantees.
Details
Keywords
Xinwang Li, Juliang Xiao, Wei Zhao, Haitao Liu and Guodong Wang
As complex analysis of contact models is required in the traditional assembly strategy, it is still a challenge for a robot to complete the multiple peg-in-hole assembly tasks…
Abstract
Purpose
As complex analysis of contact models is required in the traditional assembly strategy, it is still a challenge for a robot to complete the multiple peg-in-hole assembly tasks autonomously. This paper aims to enable the robot to complete the assembly tasks autonomously and more efficiently, with the strategies learned by reinforcement learning (RL), a learning-accelerated deep deterministic policy gradient (LADDPG) algorithm is proposed.
Design/methodology/approach
The multiple peg-in-hole assembly strategy is designed in two modules: an advanced planning module and a bottom control module. The advanced module is completed by the LADDPG agent, which is used to derive advanced commands based on geometric and environmental constraints, that is, the desired contact force. The bottom-level control module will drive the robot to complete the compliant assembly task through the adaptive impedance algorithm according to the command set issued by the advanced module. In addition, a set of safety assurance mechanisms is developed to safely train a collaborative robot to complete autonomous learning.
Findings
The method can complete the assembly tasks well through RL, and it can realize satisfactory compliance of the robot to the environment. Compared with the original DDPG algorithm, the average values of the instantaneous maximum contact force and contact torque during the assembly process are reduced by approximately 38% and 74%, respectively.
Practical implications
The entire algorithm can also be applied to other robots and the assembly strategy can be applied in the field of the automatic assembly.
Originality/value
A compliant assembly strategy based on the LADDPG algorithm is proposed to complete the automated multiple peg-in-hole assembly tasks.
Details
Keywords
Ke Xu, Fengge Wu and Junsuo Zhao
Recently, deep reinforcement learning is developing rapidly and shows its power to solve difficult problems such as robotics and game of GO. Meanwhile, satellite attitude control…
Abstract
Purpose
Recently, deep reinforcement learning is developing rapidly and shows its power to solve difficult problems such as robotics and game of GO. Meanwhile, satellite attitude control systems are still using classical control technics such as proportional – integral – derivative and slide mode control as major solutions, facing problems with adaptability and automation.
Design/methodology/approach
In this paper, an approach based on deep reinforcement learning is proposed to increase adaptability and autonomy of satellite control system. It is a model-based algorithm which could find solutions with fewer episodes of learning than model-free algorithms.
Findings
Simulation experiment shows that when classical control crashed, this approach could find solution and reach the target with hundreds times of explorations and learning.
Originality/value
This approach is a non-gradient method using heuristic search to optimize policy to avoid local optima. Compared with classical control technics, this approach does not need prior knowledge of satellite or its orbit, has the ability to adapt different kinds of situations with data learning and has the ability to adapt different kinds of satellite and different tasks through transfer learning.
Details
Keywords
Tao Pang, Wenwen Xiao, Yilin Liu, Tao Wang, Jie Liu and Mingke Gao
This paper aims to study the agent learning from expert demonstration data while incorporating reinforcement learning (RL), which enables the agent to break through the…
Abstract
Purpose
This paper aims to study the agent learning from expert demonstration data while incorporating reinforcement learning (RL), which enables the agent to break through the limitations of expert demonstration data and reduces the dimensionality of the agent’s exploration space to speed up the training convergence rate.
Design/methodology/approach
Firstly, the decay weight function is set in the objective function of the agent’s training to combine both types of methods, and both RL and imitation learning (IL) are considered to guide the agent's behavior when updating the policy. Second, this study designs a coupling utilization method between the demonstration trajectory and the training experience, so that samples from both aspects can be combined during the agent’s learning process, and the utilization rate of the data and the agent’s learning speed can be improved.
Findings
The method is superior to other algorithms in terms of convergence speed and decision stability, avoiding training from scratch for reward values, and breaking through the restrictions brought by demonstration data.
Originality/value
The agent can adapt to dynamic scenes through exploration and trial-and-error mechanisms based on the experience of demonstrating trajectories. The demonstration data set used in IL and the experience samples obtained in the process of RL are coupled and used to improve the data utilization efficiency and the generalization ability of the agent.
Details