Supervised learning of mapping from sensor space to chained form for unknown non-holonomic driftless systems

Purpose – This study aims to propose an of ﬂ ine exploratory method that consists of two stages: ﬁ rst, the authors focus on completing the kinematics model of the system by analyzing the Jacobians in the vicinity of the starting point and deducing a virtual input to effectively navigate the system along the non-holonomic constraint. Second, the authors explore the sensorimotor space in a predetermined pattern and obtain an approximate mapping from sensor space to chained form that facilitates controllability. Design/methodology/approach – In this paper, the authors tackle the controller acquisition problem of unknown sensorimotor model in non-holonomic driftless systems. This feature is interesting to simplify and speed up the process of setting up industrial mobile robots with feedback controllers. Findings – The authors validate the approach for the test case of the unicycle by controlling the system with time-state control policy. The authors present simulated and experimental results that show the effectiveness of the proposed method, and a comparison with the proximal policy optimization algorithm. Originality/value – This research indicates clearly that feedback control of non-holonomic systems with uncertain kinematics and unknown sensor con ﬁ guration is possible.


Introduction
For actual implementation of autonomous navigation of mobile robots, we face two problems in the real use. First, the mobile robot structure can be generally under non-holonomic constraint. Even though there exist mobile mechanisms such as omni-directional wheels that are free from non-holonomic constraints, they are complex and often their power efficiency is low. Among those non-holonomic driftless systems (Borisov et al., 2016), the unicycle, which has a single non-holonomic constraint, is a typical example, which is considered in this research. Second, the structure of the robot system can be partially unknown. Implementation of mobile robots generally requires definition and management of sensor measurement based on global coordinate system, which requires parameters of sensor configurations. It may cause a preparation cost when such parameters are unknown. Regarding mobile robot kinematics, even when we know the structure of the mobile base, parameters such as wheel radius and wheel base can be unknown. Thus, learning approach is expected to resolve this problem by covering both unknown sensor settings and partially unknown kinematics.
The stabilization problem of non-holonomic systems has been tackled often. Brockett (1983) suggested that there is no stabilizing control law in general for these systems. Rifford (2008) identified two obstructions, one global and one local, to the existence of stabilizing feedbacks. Discontinuous state feedback control laws have been proposed such as in Astolfi (1995), who suggested applying a single coordinate transformation to the non-holonomic system in chained form. Amar and Mohamed (2013) designed a controller based on kinematic polar coordinate transformations. D' Andrea-Novel et al. (1991) showed that stabilization of three-wheel mobile robots is possible with static state feedback using Lagrange formalisms and differential geometry. However, all these proposals assume that the robot kinematics, the sensor configuration and the environment are well known.
On the other hand, depending on the application, different sensors or image features may be used. For example, if Global Positioning System or ceiling camera is not available, it is not easy to obtain (x, y, u ) coordinates in Cartesian space. Even in such case, it should be possible to navigate a robot to a destination by specifying desired sensor value as the target. This approach will widen applicability of mobile robots with less calibration effort. However, the model of a control law in such robot must still agree with the configuration of the actuators, the sensors and the environment at all times, otherwise the controller will not function correctly.
There have been many proposals to adapt the control law to the problem setting automatically (Kolmanovsky and Mcclamroch, 1995). For example, Graefe and Maryniak (1998) built a map from sensor-control Jacobians for controlling robot manipulators with calibration-free visual systems, but their approach did not consider robotic systems with non-holonomic constraints. Similarly, Navarro-Alarcon et al. (2019) computed adaptive navigation systems with unknown sensorimotor models. Kobayashi et al. (2013) used a method to approximate the Jacobian by gradient descent on non-overlapping sensor spaces and extrapolate the Jacobian mapping outside the sensing ranges to estimate an integrated sensor space. They demonstrated their method in a 2-degreeof-freedom (DoF) manipulator and in a non-holonomic mobile robot traveling along an infinite wall. Still, they neutralized the non-holonomic constraints of the unicycle by discarding the coordinate that was parallel to the reference wall. Miller (1987) used a general learning algorithm to learn the relation between control inputs and sensor outputs in a robot arm. Likewise, Kobayashi et al. (2019) also proposed estimating the Jacobian matrix for visual servoing with unknown kinematics and other system parameters by approximating the relation between actuators and sensors using a measurement given by mutual information. In contrast to previous works on stabilization of non-holonomic systems, these research studies do suggest solutions to modeling the system, although they are not sufficiently general enough for considering non-holonomic constraints.
More recently, reinforcement learning algorithms (Smart and Kaelbling, 2002) have tackled the problem of uninterpreted sensors and effectors by achieving controllability of non-holonomic systems with unknown sensorimotor mapping. The acquisition of a controller could be made more sample-efficient by considering the non-holonomicity of the system, rather than relying on a hand-crafted reward design, which is often required by reinforcement learning algorithms. In addition, in the case of a driftless system, sample collection can be made more efficient and even safer by a lattice-shaped pattern of exploratory motion.
In this paper, we address the problem of learning a sensorimotor mapping for a class of non-holonomic driftless systems with unknown kinematics and unknown sensor configuration to combine the applicability of adaptive controllers and non-holonomic controllers. For this purpose, we first formulate the problem, then we present the learning approach, and finally we show simulation and experimental results of the method applied to the unicycle problem in a variety of sensor configurations. The main contribution of this paper is a method to automatically construct such mapping in a systematic way for a predefined region of expected controllability.
The remainder of this paper is organized as follows. Problem definition, notation and essential knowledge are described in Section 2. Theoretical development of the method is described in Section 3. Simulation and experimental results are presented in Section 4 and discussion follows in Section 5.

Problem setting
Let u 2 R m be the control (input) vector and s 2 R n ; n > m be the sensor (output) vector of a dynamic driftless affine system with state and output equations where q 2 R n is the vector of generalized coordinates, q is the vector of generalized velocities and H: R n ! R n is an isomorphic mapping of class C 1 . The transformation H is arbitrary and has no units. The problem tackled in this paper is to find a control law u = w (s) that realizes a desired sensor value of s (d) under the condition of unknown F, H (unknown kinematics and sensor configuration), with arbitrary q, and with non-holonomic constraints compatible with Pfaffian form, i.e. A q ð Þ q ¼ 0 (Choset et al., 2005). In other words, we can observe q but only through an uncalibrated sensor measurement. The problem is similar under redundant observations s 0 2 R r ; r > n, in which case we consider s" = H"(q) with s 00 2 R n and H" an isomorphic mapping. We assume that the inputs can be driven independently and that the sensor signal is differentiable with respect to the input. The following sections present formal descriptions of the basic concepts.

Non-holonomic systems
Holonomic systems are those whose constraints obey the equation f q 1 ; . . . ; q n ; t ð Þ¼ 0: When the constraint cannot be expressed in the form of equation (2), then the system has non-holonomic constraints. Non-holonomic systems pose more difficulties than holonomic systems because the Lagrangian equations cannot be applied. A system with first-derivative non-holonomic constraints may be expressed with the equation f q 1 ; . . . ; q n ; q 1 : ; . . . ; q 2 : ; t À Á ¼ 0: Many of these systems are characterized by a smaller number of control inputs than DoFs, like in the case of the unicycle, the rolling wheel and the rolling sphere problems.

Unicycle
In this paper, we rely on the unicycle for specifying the methods and results without loss of generality. The unicycle is a nonholonomic system with n = 3 DoFs and m = 2 control inputs. The state equation is where x, y and u denote positions and orientation of the unicycle as depicted in Figure 1. The input vector u = [u 1 u 2 ] T is normally comprised of linear and rotational velocities, but in this research, it is not specified which component of u corresponds to each of them. Here it should be noted that some variations of the unicycle, i.e. a car-like system with limited rotation, add an additional non-holonomic constraint ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi We do not deal with these constraints in this paper.

Approach
We propose an offline learning algorithm to obtain a mapping from sensor space to chained form (Jiang and Nijmeijer, 1999) as follows. First, we deduce a virtual input component u 3 by Jacobian estimation and composed of a sequence of legal inputs such that the input vector becomes u v = [u 1 u 2 u 3 ] T , with the aim of overcoming the forbidden direction posed by the nonholonomic constraint at the initial state. Second, we explore the sensor space following a fixed trajectory considering the virtual input deduced earlier to obtain a mapping f from sensor space to chained form z = f (s). Therefore, f is approximated by data collection rather than from a mathematical model, which is unknown. This method is advantageous because it allows skipping procedures of modeling, calibration and sensor mapping measurement. Finally, we validate the method by controlling the system with time-state control policy. There are many approaches to deal with controllability of systems in chained form (Luo and Tsiotras, 2000;Murray and Sastry, 1991;Jiang and Nijmeijer, 1999), but here we use time-state control, proposed in Sampei (1994) and Sampei et al. (1996) and later described in more detail with a similar control technique in Lefeber et al. (2000) and Lefeber et al. (2004), because it is relatively simple and easy to implement.

Estimation of sensorimotor mapping
We define two learning stages. The first stage tackles the problem of system controllability. In other words, the method starts by learning the inputs required to explore the sensor space efficiently. Later, the second stage uses the results of the first stage to navigate in the sensor space and gather a data set comprised of sensor samples and the corresponding generalized coordinates in chained form assuming a well-defined trajectory. This data set is used to infer a mapping from sensor space to chained form. In Section 3.3, we describe the method to assess the accuracy of the approach.

Model learning
In the first stage, the controller learns to navigate efficiently through the sensor space. For that purpose, the control system needs to learn how to control the variation of each coordinate of the sensor signal independently. However, there are only two inputs but three coordinates in the sensor signal, thus only two dimensions are immediately controllable from the initial position. Here we show how to calculate a sequence of motions to travel along the forbidden direction while minimizing variations along the rest of the sensor space.
3.1.1 Jacobian. Let s p ð Þ 2 R 3 be the unitless sensor observation sampled at point p, where p is identified by sensor value, and let s The sensor-control Jacobian J is defined as a measure of the variability of the sensor signals with respect to the inputs in matrix form: J :¼ @s @u nÂm ¼ @s @u 1 @s @u 2 @s 2 @u 1 @s 2 @u 2 @s 3 @u 1 @s 3 @u 2 2 6 6 6 6 6 6 4 3 7 7 7 7 7 7 5 At the initial state s (0) , the sensor-control Jacobian J (0) = J(s = s (0) ) of the system indicates the variation of the sensor signal with respect to each input u (1) and u (2) . The state equation in sensor space at any point p and input u (#) is Considering control input u = t 1 u (1) = [t 1 0] T , then s Similarly for u = t 2 u (2) = [0 t 2 ] T and substituting for s See Figure 2 for a depiction of equation (9). Hence, the Jacobian is easily obtained by measuring the change in sensor values before and after applying a constant input u (#) to the system, sequentially for every input. After obtaining each Figure 1 The unicycle is a non-holonomic system with canonical generalized coordinates (x; y; u ) as shown and non-holonomic constraint _ xsinu À _ ycosu ¼ 0 Jacobian element, the system backtracks its movements to return to s (i) . In the case of the unicycle, there is one inaccessible dimension, so we need to find a state s ( * ) , whose Jacobian J ( * ) contains an element j or at least as close to 61 as possible, assuming that all Jacobian elements are normalized ( Figure 3). We explore the vicinity of s (0) in search for s ( * ) by applying a control policy such that an Then the Jacobian J (p) is obtained at the resulting state s (p) , and the system is taken back to the initial state by applying Àu s p ð Þ ð Þ for the same amount of time. The process is repeated in an exponential search for Dt s ★ ð Þ ð Þ and u s ★ ð Þ ð Þ that optimizes equation (10): 3.1.2 Virtual input. As shown above, the input u s ★ ð Þ ð Þ applied for Dt s ★ ð Þ ð Þ will reach state s ( * ) , which is a state where applying input u (c ) will maximize movement in the direction forbidden by the nonholonomic constraint at s (0) . If u (c ) for Dt is then followed by input that is, the system effectively travels along the forbidden direction at s (0) . Therefore, we have deduced a sequence of inputs whose end result is equivalent to a virtual input the direction of the non-holonomic constraint: We will use inputs u (1) , u (2) and virtual input u (3) to navigate the sensor space freely in the next stage. Conceptually, it is similar to a holonomic system with an additional input.

Mapping of sensor space to chained form
The state of a non-holonomic system depends on the history of the control inputs as a result of the non-integrable constraints. Therefore, navigation in the sensor space requires tracking the input history and applying it to the kinematics of the system to obtain a consistent state. However, in this problem, we cannot rely on the kinematics of the system. Here we propose to circumvent the kinematics problem by returning the state to its initial position by backtracking the sequence of inputs applied to reach each sampled state. This method requires that there are no significant deviations in the trajectory of the system when backtracking compared to the outward trajectory.

Chained form
Chained form is a canonical formulation that obeys the formula (shown here for the two-input case) where g 1 z ð Þ ¼ Now, let f = [f 1 f 2 f 3 ] T be the mapping of sensor coordinates to chained form coordinates, denoted by where i indicates some state. Knowing from the definition of Jacobian that and deriving equation (16) with respect to time assuming that u is constant, From the definition of chained form equation (14), the following holds: Equating equation (18) to equation (19) and removing u, we arrive at where it can be seen that G has two terms: the first one df ds is the Jacobian of the mapping from sensor space to chained form with respect to the sensor observations, and the second one J (i) Figure 2 The Jacobian is obtained from subtracting the sensor observation at the target state s (i) from the observations after applying inputs u (1) and u (2) for a small amount of time is the same Jacobian as in the first stage, although sampled at different coordinates. Actually, we can sample G directly, bypassing the need to calculate the two terms separately.

Exploration of sensor space
In this section, we describe the method used to obtain pairs (s (i) , z (i) ) of corresponding coordinates in sensor space and chained form space. The sensor space sampling procedure involves controlling the system with a fixed sequence of inputs to reach each vertex in a grid in virtual space. The coordinates of each vertex are (c 1 u (1) Dt c , c 2 u (2) Dt c , c 3 u (3) Dt c ) where c 1 , c 2 , c 3 2 Z are the indexes for the grid coordinates and Dt c is a fixed parameter.
The fixed sequence of inputs must abide to the following rules on account of previously mentioned limitations: Every input must be backtracked in reverse order. u s ★ ð Þ ð Þ must always be applied in last place to prevent traversing along the subspace of virtual input u (3) inadvertently.
The use of the virtual input u (3) should be minimized to reduce cumulative position errors.
The algorithm used herein is shown in Algorithm 1. The sensor samples s (i) are read from the sensor observations while the virtual states z (i) are calculated internally based on the input history. At each point of the grid, the virtual coordinate z (i) is recorded together with the sensor observation s (i) at that point. The resulting pair is incorporated into the data set for training the approximated mapping.

Function approximation.
The last step toward obtaining f consists in processing the data set obtained earlier by making use of radial basis functions with Gaussian kernels (Gaussian RBF), but other supervisedlearning techniques such as neural networks should also be valid.
A Gaussian kernel takes the form Gaussian RBF is a linear combination of Gaussian kernels distributed in the target region of the approximation. Each location is set by b j and denoted a base. Thus, for one output variable, where N B is the total number of kernels and w ¼ w 1 Á Á Á w NB Â Ã > are the unknown linear coefficients, or weights. In this research, we specified the number of kernels and the approximation started by distributing the kernels in an orthogonal grid covering all sensor samples in the data set. The weights are calculated by least squares as specified in Kondor (2004). The loss function is Taking N as the number of points in the data set, we define (24) and the solution with regularization term l is which replaced in equation (22) gives us, at last, f .

Assessment
Evaluation of the estimated mapping from sensor space to chained form space is performed by placing the system at any point in the sampled region of the sensor space and controlling it to the origin. Here we apply time-axis control although alternative controllers may be equally valid. Non-holonomic systems often have non-linearities that linear controllers cannot deal with. Overcoming these difficulties is out of the scope of this research. For simplification, we assume that the starting rotation of the assessment of the controller is approximately parallel to the starting rotation of the learning stages.

State space control of time-axis form.
The time-state control strategy involves transforming the state equation of a non-holonomic system into two independently controlled state equations (Sampei, 1994): the time control part and the state control part. The transformation involves a change in coordinates z a[t n] T 2 R 3 . The state equation of the time control part consists of a single generalized coordinate t 2 R controlled by the input component u 1 . Typically, control of t is constant, i.e. u 1 = 1. The state control part can then be represented by Equation (27) sees the time variable replaced with t , thereby equation (26) controls the time scale of equation (27). Control of t to the origin is achieved by alternating positive and negative values of u 2 until n = 0 and then making t = 0 with u 1 . The advantage of time-state control form is that in many cases, the state control part can be designed as if there were no nonholonomic constraints in the state equation. Application of non-linear transformation to q from some non-linear system to time-state control form with generalized coordinates (t , j ) enables linear feedback control on the system. The time-state part of a 3-DoF state is where we have set h(t ,n) = 1, and its control-state part is where We now show the control law for assessing controllability of the target system. System (28) is driven by a constant input u 1 = 1 and system (29) is controllable by state-space control with one input u 2 = f(z 2 , z 3 ), which is calculated as follows. The controllability matrix of equation (29) is (Dominguez et al., 2006) which is controllable as its rank is two. Under feedback stabilization with parameters K = (k 1 k 2 ), the feedback controlled matrix A f becomes Given control poles p 1 and p 2 , the characteristic polynomial is so k 1 = p 1 1 p 2 and k 2 = p 1 1 p 2 . Thus, the control input with poles p 1 and p 2 is Consequently, by controlling u 2 with equation (34), we stabilize the system close to the time-axis indicated by equation (28).

Implementation and results
We tested and validated our approach under simulated conditions and experimentally on a real robot.

Simulation
We used the canonical state equation for the unicycle q ¼ with u 1 indicating linear speed and u 2 rotational speed. These two equations are hidden from the learning and control algorithm: only s may be sampled and only u may be modified arbitrarily. Here, we show the results for the following three variations of H, which have been designed so that the mapping is isomorphic in the region of interest: H 3 q ð Þ ¼ x 1 e y e x À y u 3 2 4 3 5 : In all three cases, the parameters of the simulation were set as follows: the initial state was q 0 = [0 0 0] T , 9 samples per axis with a separation of 0.25 units in the first stage, 5 samples per axis (total of 5 3 samples) in the range [-2, 2] for constructing the data set, approximated by Gaussian RBF with 5 3 kernels, standard deviation 1.5 multiplied by the minimum distance between kernels and regularization term l = 0.5 in the second stage. The linear controller for assessment had poles (-5, 5), starting position (x, y, u ) = (-2, 0.5, p/4) and running time of 2.5 s. Standard deviation for the Gaussian kernels for each case were s 1 = 1.1970, s 2 = 0.6643 and s 3 = 3.0745, respectively. The trajectory, observations and Gaussian kernel locations in sensor space of the sensor space mapping stage are shown in Figures 4 and 5.
In the three sensor configurations, the system was successfully controlled to the time axis. We did not add a back and forth control to u 1 to control the whole system to the origin because it was irrelevant for the purposes of this paper. Errors in the z 2 -axis of the transformation function f , corresponding to y in (x, y, u ) space, along the time-axis in chained space were negligible (f 2 (t ) = 0 6 10 À13 for the three cases). In contrast, errors in the z 3 -axis, corresponding to u in (x, y, u ) space, were significant: f 3 (t ) = 0 6 0.0138 for H 1 , f 3 (t ) = 0 6 0.0759 for H 2 and f 3 (t ) = 0 6 0.2356 for H 3 . The inaccuracies in the approximation of sensor observations to chained space are perceived as perturbations by the control law, and are appropriately corrected. Indeed, these inaccuracies resulted in small deviations of the controlled trajectory as shown in Figure  6. Positional errors derived from inaccuracies in the actuators were negligible as expected in a simulated environment. In the case of H 1 , the sensor mapping is the identity, thus the trajectory in sensor space matches the trajectory in (x, y, u ) space. The sampled observations are evenly distributed across the sensor space and the approximation of f is good. With respect to H 2 , the Gaussian kernels cannot approximate accurately all the sampled observations in the region s 2 2 (0, 1) ( Figure 5 -H 2 ), which derives in a slight deviation of the trajectory as shown at the left of Figure 6 -H 2 . In the case of H 3 , the oscillation in the trajectory cannot be explained by the control poles because they are real values. Rather, the inaccuracies are better explained by the high concentration of sampled observations compared to the number of kernels near the initial state as shown in Figure 5 -H 3 .

Figure 4
The trajectory of the simulated system during sensor space sampling (stage 2) in (x; y; u ) coordinates is the same for H 1 , H 2 and H 3 Note: Arrows indicate sampled states

Mobile robot
We tested our approach experimentally on a real robot (Figure 7). We used the mobile robot model Pioneer 3-DX [1], which features two feedback-controlled wheels with a high resolution encoder and a swivel caster for balance. We used a 5K PTZ camera fixed on the ceiling and connected to an image processing workstation. The camera images were processed with OpenCV and Armadillo libraries and involved image segmentation by colors, noise removal and identification of beacon characteristics. The beacons were installed on the robot as shown in Figure 8. The sensor outputs were the (x, y) pixel coordinates of the centroid of one beacon and the angle between the line connecting both beacons and y = 0. The CORBA [2] implementation by omniORB [3] was used to connect all off-board and on-board components. Inputs to the robot were linear and rotational speed, as in the simulation. The parameters of the sampling controllers were similar to the simulation but with reduced number of samples: six samples per axis with a separation of 0.3 units in the first stage, four samples per axis (total of 4 3 samples) separated by 0.667 units for constructing the data set, approximated by Gaussian RBF with 4 3 kernels, standard deviation of 0.45 and regularization  The mobile robot was successfully controlled after exploration (Figure 9) of the sensor space. The output of the first stage was u (*) = u (2) Dt (*) = 1.5 s. Figure 10 shows four trajectories starting at different points in the left and bottom sides of the figure converging toward the approximate position of the time axis. As in the simulations, we did not add a back and forth control to the linear velocity input. The imperfections in the sample observations in the second stage may be attributed to perspective deformation, lens aberrations, signal noise, image processing lag, partial occlusion of beacons and cumulative positional errors.

Comparison to proximal policy optimization
We compared the proposed approach to proximal policy optimization (PPO), which is a class of reinforcement learning algorithm (Schulman et al., 2017), in a problem setting similar to the proposed one. The desired sensor value s (d) = H([0 0 0] T ) was implicitly defined in the reward function The unknown output equation was the same as in the first simulated environment H(q) = H 1 (q) = q, discount factor g = 0.997, sample time T s = 0.1 s and initial state for each training episode (x 0 , y 0 , u 0 ) = (-2, 0.5, p /4) 1 r/10 where r is a vector of standard normal distributed random values. The linear speed control input of the simulated system was fixed at 1, while rotation speed control input was controlled by the PPO algorithm. This way, the control inputs used in the assessment of the proposed method, which relies in time-axis control, and in the PPO controller matched more closely. After 138 episodes, training was stopped with an average reward of 5,358 units over the past 5 agents ( Figure 11) and a total number of sensor observations of 2,926. The agents from episode 134 to episode 138 were selected to control a unicycle from (x 0 , y 0 , u 0 ) = (À2, 0.5, p /4). Figure 12 shows the trajectory of the agent at step 136. The average closest distance to the origin by the last five PPO agents was m PPO = À0.0783 m (against m H1 ¼ À0:0114 m in our method) units with standard deviation s PPO = 0.0878 . Compared to the proposed method, PPO required more samples to arrive at a controller (2,926 against 5 3 = 125 in the proposed method), and yet PPO was only trained for controlling the system from (x 0 , y 0 , u 0 ) = (À2, Figure 8 The experimental setup as seen by the camera and image processing output Figure 9 Sampled points in camera coordinates for the data set Notes: Note that the camera y-axis is inverted. The angle of the arrows indicates the value of s 3 in radians. The empty-filled arrow indicates coordinates of the initial position for stages 1 and 2. The crosses indicate the location of Gaussian kernel centerst Figure 10 The feedback-controlled trajectories in sensor space starting from four different points show convergence at the time-axis 0.5, p /4). Moreover, PPO required that the system was repositioned on the starting state at the beginning of each episode, while the proposed method only requires to be placed at the desired state once and is controllable in all of the region of sensor exploration. The proposed method was shown to be safer and efficient in the sense that it can avoid unexpected exploration in the process of sample collection.

Conclusion
In this paper, we have proposed a method to learn the sensorimotor mapping of an unknown non-holonomic driftless system and unknown sensor configuration with the purpose of system controllability in a predefined target region. The proposed method consists of two stages. First, we explored the vicinity of the system at the initial state to maximize maneuverability of the system with respect to the sensor signal. Second, we explored the sensor space to construct a mapping from sensor space to chained form. We carried out some simulations and real experiments to show that the trained controller is capable of controlling the system after exploration of the sensor space, therefore validating the method. The results show that the accuracy of the approximation of the mapping from sensor space to chained form and the repeatability of the movements of the robot play a significant role in the performance of the method. Finally, the results were compared against the PPO algorithm, showing that the proposed method requires fewer observations and is safer to deploy in the target environment.
The most important limitations that we have identified are, first, that the controllability region is bounded to the sampled region of the sensor space, although this limitation is not specific to our method but to function approximation by radial basis functions in general. Second, that learning is performed offline because we could not rely on assumptions that would have enabled online learning because of the generality of the problem requirements (i.e. we do not know the kinematics of the system). Third, the sensor space sampling stage is affected by the curse of dimensionality; hence, this method is not suitable to systems whose state space has a high number of dimensions.
We expect to make further improvements in the future by dropping some of the assumptions. Specifically, this method does not support non-holonomic systems with j 0 ð Þ 1 not orthogonal to j 0 ð Þ 2 , such as the unicycle system with independently controlled wheels. To overcome this problem, an additional stage prior to Jacobian learning should search for the combination of inputs that maximizes the orthogonality between j 0 ð Þ 1 and j 0 ð Þ 2 . Furthermore, it seems reasonable to remove backtracking by controlling the system to the origin using linear control but it is not clear yet under which conditions it is possible. More research is needed in these areas to increase the scope of applicability.
Notes 1 https://cyberbotics.com/doc/guide/pioneer-3dx Reward [-] Episode inde x Figure 12 Trajectory of a trained PPO agent (solid line) and the proposed method (dotted line) for the sensor configuration H 1