 Open Access
 Total Downloads : 0
 Authors : R. Ajay Balaji, N. Mohamed Inzamam Kutty
 Paper ID : IJERTCONV6IS14083
 Volume & Issue : Confcall – 2018 (Volume 06 – Issue 14)
 Published (First Online): 05012019
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Artificial Intelligence used in Robotics
Artificial Intelligence used in Robotics
R. Ajay Balaji, N. Mohamed Inzamam Kutty
Department of Mechanical Engineering Parisutham Institute of Technology and Science
Abstract: Novel anthropomorphic robotic systems increasingly employ variable impedance actuation with a view to achieving robustness against uncertainty, superior agility and improved efficiency that are hallmarks of biological systems. Control ling and modulating impedance profiles such that they are optimally tuned to the controlled plant is crucial in realizing these benefits. In this work, we propose a methodology to generate optimal control commands for variable impedance actuators under a prescribed tradeoff of task accuracy and energy cost. We employ a supervised learning paradigm to acquire both the plant dynamics and its stochastic properties. This enables us to prescribe an optimal impedance and command profile (i) tuned to the hardto model plant noise characteristics and (ii) adaptable to systematic changes. To evaluate the scalability of our framework to real hardware, we designed and built a novel antagonistic series elastic actuator (SEA) characterized by a simple mechanical architecture and we ran several evaluations on a variety of reach and hold tasks. These results highlight, for the first time on real hardware, how impedance modulation profiles tuned to the plant dynamics emerge from the first principles of stochastic optimization, achieving clear performance gains over classical methods that ignore or are incapable of incorporating stochastic information.
Keywords: Antagonistic actuator, dynamics learning, equilibrium point control, impedance control, stochastic optimal control
1. INTRODUCTION
Humans have remarkable abilities in controlling their limbs in a fashion that outperforms most artificial systems in terms of versatility, compliance and energy efficiency. The fact that biological motor systems suffer from significant noise, sensory delays and other sources of stochasticity (Faisal et al. 2008) makes their performance even more impressive. Therefore, it comes as no surprise that biologi cal motor control is often used as a benchmark for robotic systems. On the one hand, biological motor control charac teristics are a result of the inherent biophysical properties of human limbs, and on the other hand, they are achieved through a framework of learning and adaptation processes (Wolpert et al. 1995; Kawato 1999; Davidson and Wolpert 2005). These concepts can be transferred to robotic systems by (i) developing appropriate anthropomorphic hardware and (ii) by employing learning mechanisms that support motor control in the presence of noise and perturbations (Mitrovic et al. 2008).
In this paper, we focus on issues related to adaptive motor control of antagonistically actuated robots. Antagonistic actuator designs are based on the biological principle of
opposing muscle pairs. Therefore, the joint torque motors, for example, of a robotic arm are replaced by opposing actuators, typically using mechanical springs (Pratt and Williamson 1995). Such series elastic actuators (SEA) have had increasing attention in the last few decades (Van derborght et al. 2009) as they provide several beneficial properties over classic joint torque actuated systems:

Impedance control and variable compliance: Through the use of antagonistic actuation, the system is able to vary cocontraction levels, which in turn change the sys tems mechanical properties: this is commonly referred to as impedance control (Hogan 1984). Impedance in a mechanical system is defined as a measure of force response to a motion exerted on the system and is composed of components such as inertia, damping and stiffness. In general SEAs can only vary stiffness of a system and achieving variable damping is technically challenging (e.g. Laffranchi et al. 2010). Consequently, in this paper, when we refer to impedance control, we will solely address a change in stiffness and ignore vari able damping or variable inertia. Antagonistic actuation introduces an additional degree of freedom in the limb dynamics, i.e. the same joint torque can be achieved by different muscle activations. This means low co contraction leads to low joint impedance whereas high cocontraction increases joint impedance. This degree of freedom can be used beneficially in many motion tasks, especially those involving manipulation or inter action with tools. It has been shown through many neu rophysiological studies (e.g. Burdet et al. 2001) that humans are capable of modulating this impedance in an optimal way with respect to the task requirements, trading off selectively against energy consumption. For example, when you use a drill to drill holes in a wall, you learn how to co contract your muscles such that the random perturbations of the drilling have a mini mal impact. Furthermore, the ability to vary compliance plays a crucial role in robot safety (Zinn et al. 2004). In general, impedance modulation is an efficient way to control systems that suffer from noise, disturbances or sensorimotor delays.

Energy efficiency and energy storage: By appropriately controlling the SEA, one can take into account the passive
properties of the springs and produce control strategies with low energy requirements. A wellknown example is walking, where the spring properties com bined with an ideal actuation timing can be used to produce energetically efficient gaits (Collins and Ruina 2005; Collins and Kuo 2010). Furthermore, an SEA has impressive energy storage and fast discharge capabili ties, enabling explosive behavior such as throwing a ball (Wolf and Hirzinger 2008), which is quite hard to achieve with regular joint torque controllers. Therefore, series elasticity can amplify power and work output of an actuator, which is important in the fabrication of lightweight but powerful robotic or prosthetic devices (Paluska and Herr 2006).
A disadvantage of antagonistic actuation is that it imposes higher demands on the redundancy resolution capabilities of a motor controller. Optimality principles have successfully been used in biological (Flash and Hogan 1985; Scott 2004; Todorov 2004) and in artificial systems (Nakamura and Hanafusa 1987; Cortes et al. 2001) as a principled strategy to resolve redundancies in a way that is beneficial for the task at hand. More specifically, stochas tic optimal control (SOC) (Stengel 1994; Bertsekas 1995; Todorov 2006) appears to be an especially appealing the ory as it studies optimality principles under the premise of noisy and uncertain dynamics. Another important aspect when studying stochastic systems is how the information,
for example, about noise or uncertainty is obtained with out prior knowledge. Supervised learning methods can pro vide a viable solution to this problem as they can be used to extract information from the plants sensorimotor data directly.
Here, we propose a control strategy for antagonistic sys tems that is based on stochastic optimal control theory under the premise of minimal energy cost. We propose to extend SOC by learning the dynamic model of the plant, which enables us (i) to adapt to systematic changes of the plant and (ii) extract its stochastic properties. Stochastic properties or stochastic information refers to noise or ran dom perturbations of the controlled system that cannot be modeled deterministically. By incorporating this stochas tic information into the optimization process, we show how impedance modulation and cocontraction behavior emerges as an optimal control srategy from first principles. In the next section, we present a new antagonistic actua tor, which serves as our implementation testbed for study ing impedance control in the presence of stochasticity and which, compared to previous antagonistic designs, has a much simpler mechanical design. In Section 3, we introduce the basic concepts of optimal control and propose an exten sion that uses a learned dynamic model. This supervised learning methodology allows us to adapt online to changes in the dynamics as well as to extract localized stochastic information from movement data. We then propose a sys tematic methodology for incorporating deterministic and stochastic plant dynamic information into the optimal con
trol framework, resulting in a scheme that improves per formance significantly by exploiting the antagonistic redun dancy of our plant. Our claims are supported by a number of experimental evaluations on real hardware in Section 4. We conclude the paper with a discussion and an outlook.
2. A Novel Antagonistic Actuator Design for Impedance Control
To study impedance control, we developed an antagonistic joint with a simple mechanical setup. Our design is based on the SEA approach in which the driven joint is connected via spring(s) to a stiff actuator (e.g. a servomotor). A vari ety of SEA designs have been proposed (for a recent review see Vanderborght et al. (2009)), which we here classify into pseudoantagonistic and antagonistic setups. Pseudo antagonistic SEAs have one or multiple elastic elements, which are connected between the driving motor and the driven joint. The spring tension and therefore, the joint stiffness, is regulated using a mechanism equipped with a second actuator. Antagonistic SEAs have one motor per opposing spring and the stiffness is controlled through a combination of both motor commands. Therefore, in antag onistic designs, the relationship between motor commands and stiffness must be resolved by the controller. This addi tional computational cost is the tradeoff for a biologically plausible architecture.
Fig. 1. Schematic of the variable stiffness actuator. The robots dimensions are: a = 15 mm, L = 26 mm, d = 81 mm, h =
27 mm. The spring rest length is s0 = 27 mm.
For antagonistic SEAs, nonlinearity of the springs is essential to obtain variable compliance (van Ham et al. 2009). Because forces produced through springs with lin ear tensiontoforce characteristics tend to cancel out in an antagonistic setup, an increase in the tension of both springs (i.e. cocontraction) does not change the stiffness of the system. Commercially available springs usually have lin ear tensiontoforce characteristics and consequently most antagonistic SEAs require relatively complex mechanical structures to achieve a nonlinear tension force curve
(Hurst et al. 2004; Migliore et al. 2005; Tonietti et al. 2005). These mechanisms typically increase construction and maintenance effort but also can complicate the sys tem identification and controllability, for example, due to
added drag and friction properties. We directly addressed
Note that K depends linearly on the spring stiffness , but the geometry of the arm induces a nonlinear dependency on and . Figure 2 shows the c omputed profiles of the equilibrium position and stiff ness, respectiv ely.
Further, denoting the arms inertia around the zaxis by Iz
this aspect in our design of the SEA, which aims to achieve
and a damping torque given by ( ) D, the
namic
variable stiffness characteristics using a simple mechanical
setup.
Variable Stiffness with Linear Springs
Here we propose an SEA design that does not rely on com plex mechanisms to achieve variable stiffness but achieves the desired properties through a specific geometric arrange ment of the springs. While the emphasis of this paper is not on the mechanical design of actuators, we will explain the essential dynamic properties of our testbed. Figure 1 shows a sketch of the robot, which is mounted horizontally and consists of a single joint and two antagonistic servo motors that are connected to the joint via linear springs. The springs are mounted with a moment arm offset a at the joints and an offset of L at the motors. Therefore, the springendpoints move along circular paths at the joints and at the motors. Under the assumption that the servomotors are infinitely stiff, we can calculate the torque acting on the arm as follows. Let s1 denote the vector from point C to
A, and s2 the vector from D to B, and s1 and s2 their respec tive lengths. Putting the origin of the coordinate system at the arm joint, we have
= dy
equation can be analytically written as
IzÂ¨= (, ,)D. (5)
Actuator Hardware
Figure 3 depicts our prototype SEA hardware implemen tation of the discussed design. For actuation, we employ two servomotors (Hitec HSR5990TG), each of which is connected to the arm via a spring mounted on two low friction ball bearings. To avoid excessive oscillations, the joint is attached to a rotary viscous damper. The servos are controlled using 50 Hz PWM signals by an Arduino Duemi lanove microcontroller board (Atmel ATmega328). That board also measures the arms joint angle with a contact free rotary position encoder (Melexis MLX90316GO),
as well as its angular acceleration Â¨ using a LilyPad
accelerometer (Analog Devices ADXL330). Finally, we also measure the servomotor positions by feeding a signal
h L sin
a cos
s1 = d + L cos a sin ,
0 0
h + L sin
z Â¸ s
=a1
a cos
s2 = d + L0 cos a s0in . (1)
z Â¸ s
=a2
Denoting the spring constant by and the rest length by s0, this yields forces
s
F1 = ( s1 s0 ) s1
1
s
and F2 = ( s2 s0) s2
2
. (2)
Given the motor positions and and the arm position , the torque generated by the springs is
( , , ) = zT ( F1 Ã— a1 + F2 Ã— a2 ) , (3)
where zT denotes the threedimensional basis vector ( 0, 0, 1)T. To calculate the equilibrium position eq for given
motor positions and , we need to solve ( , , e=q) 0, which in practice is by numerical optimization. At this
position, we can calculate the joint stiffness as
K( , )= (, , ).
=eq
. (4)
180
150
u1 [deg]
120
90
60
30
0
Equilibrium Position
0 30 60 90 120 150 180
u2 [deg]
50
0
50
180
150
u1 [deg]
120
90
60
30
0
Joint Stiffness
0 30 60 90 120 150 180
u2 [deg]
0.5
0.45
0.4
0.35
0.3
Fig. 2. Left: Equilibrium position as a function of the motor positions (in degrees), with contour lines spaced at 5 intervals. Right: Stiffness profile of the arm, as calculated from Equation (4). The maximum achievable stiffness is 150% of the intrinsic spring stiffness.
Motor 1 [deg]
150
100
50
Commanded
From sensor modelled (FIR filter)
0
0 2 4 6 8 10 12 14 16 18 20
Time [s]
150
Motor 2 [deg]
100
50
0
0 2 4 6 8 10 12 14 16 18 20
Time [s]
Joint accelerations [deg/s2]
10000
5000
0
Measured Predicted
Fig. 3. Photograph of our antagonistic robot. Inset panel (a): Sep arate servomotor mounted at the end of the arm to create stochastic
5000
0 2 4 6 8 10 12 14 16 18 20
Time [s]
perturbations (see Section 4.2).
from their internal potentiometer to the AD converters of the Arduino. While the operating frequency is limited to 50 Hz due to the PWM control, all measurements are taken at
a 4Ã—higher frequency and averaged on the board to reduce the amount of noise, before sendin the results to a PC via
a serial connection (RS232/USB).
System Identification
Fig. 4. Comparison of prediction of performance of estimated
motor dynamics (top and middle) and of arm dynamics (bottom) for an independent test data set.
use are very accurate, they need some time to reach the desired position, and therefore we model the true motor positions (, ) as a lowpass filtered version of the commands ( u1, u2) using a finite impulse response (FIR) filter, i.e.
Apart from measuring the exact dimensions (L = 2.6 cm,
K
.
[n] = ( h u ) [n] + [n] = h[k]u [n k] + [n]a =1.5 cm, h 2=.7 cm, d 8=.1 cm) of the robot, and the stiffness constant of the spring (= 424 N m1), system
1 1
k=0
(6)
identification consists of a series of steps, each of which
involves a leastsquares fit between known and actually measured quantities.

Identify servomotor dynamics: The servomotors are controlled by sending the desired position (encoded as a PWM signal), which we refer to as u1 and u2 for motors 1 and 2, respectively. Even though the servomotors we
and similarly for and u2. The term [t] denotes a
noise component of the true motor position that can not be modeled with the FIR filter. By using the inter nal potentiometer of the servomotors, we can measure the actual motor positions to identify the filter coeffi cients hi using a least squares fit, that is, by minimizing
1 i
. t ( [n] ( h u ) [n])2 with respect to h . Weretrieved a good fi t motor dynamics (cf. Figure 4) using an
t of he
FIR filter with seven steps, with estimated coefficients
h = [0, 0, 0, 0.0445, 0.2708, 0.3189, 0.3658].

Calibrate position sensor: Tests with the position sensor
revealed linear position characteristics. By moving the arm physically to several predefined and geometrically measured positions, we determined the sensors offset and slope.

Calibrate acceleration sensor: We matched the accel erations measured with the accelerometer with accel erations derived from the position sensor (using finite differences).

Collect training data and fit parameters: We carried out motor babbling (any excitation movements are applica ble) on the servos and measured the resulting arm posi tions, velocities and accelerations. Taking into account the estimated motor dynamics using the fitted filter, we
solving a twopoint boundary difference/differential equa tion derived by applying Pontryagins minimum principle (Stengel 1994). In practice, in the presence of small pertur bations or modeling errors, the optimal openloop sequence of commands can be run on the real plant together with a simple PD controller that corrects deviations from the planned trajectory. However, those corrections will usually not adhere to the optimality criterion, and the resulting cost J will behigher.
Alternatively, we can try to incorporate stochasticity, e.g. as a dynamic model
dx = f( x, u) dt + F( x, u) d , N ( 0, I) (8) directly into the optimization process and minimize the expected cost.1 Here, d is a Gaussian noise process and
F( Â·) tells us how strongly the noise affects each part of
estimated the armsinertia (Iz = kgm2
rad(2))andvis
the state and control space. A wellstudied example of this
cous damping (D =
N msrad(1)) coefficient using
case is the LQG problem, which stands for linear dynamics
least squares from Equation (5).
On a large independent test set of Stes=t 300, 000 data points, the motor prediction produces a Normalized Mean
Squared Error (NMSE) of enmse = 1.85%. Figure 4 shows an example prediction of performance for a sequence of
random motor commands (20 s from the test set Stest) using
(f( x, u=) Ax B+u), quadratic cost (in both x and u), and
Gaussian noise (F is constant). A solution to this class of
problems is the optimal feedback controller (OFC), that is,
a policy u = ( x, t) that calculates the optimal command
u based on feedback x from the real plant. In the LQG
case, the solution is a linear feedback law u=L( t) x with
2
the estimated dynamic model.
precomputed timedependent gain matrices
1994).
L( t) (Stengel

Stochastic Optimal Control
In many control scenarios it is desirable to be able to perform in the best way possible. For example, one may wish to move the system to a desired posture and consume as little energy as possible during the movement. This type of problem is studied in optimal control theory, the central ingredient of which is the minimization of an optimality criterion
T
+
Solving OFC problems for more complex systems (non linear dynamics, nonquadratic cost, varying noise levels
F) is a difficult computational task. A general way to solve OFC problems for nonlinear quadratic problems is Dynamic Programming (DP) (Bellman 1957). DP in its basic form relies on a discretization of the state and action space, which in practice is difficult to obtain: On the one hand, tiling the stateaction space too sparse will lead to poor representation of the underlying plant dynamics. On the other hand, a very fine discretization leads to a com binatorial explosion of the problem, which is commonly
J [u] =
J [u] =
0 c( x( t), u( t), t) dt h( x( T )) or
c( x( t), u( t), t) dt, (7)
0
referred to as the curse of dimensionality. For example, con sider a discretization of 100 steps for each variable of the
state and action space. In the case of the presented SEA,
this corresponds to a state space dimensionality =n 2, for positions and velocities, and action space dimensionality
for a task with a finite or infinite horizon. Apart from the
optional final cost h(Â·), the criterion integrates a cost rate
c( x, u) over the course of the movement. That cost may
depend on the systems state x, control commands u and on time t, where the initial state of the system is given as x( 0), and x( t) evolves depending on the commands u( t). In the context of biological motor control, this theory has been studied for decades with the wellknown examples of various optimality criteria such as minimum time (Bobrow et al. 1985), energy (Li and Todorov 2007), jerk (Flash and Hogan 1985) and torque change (Uno et al. 1989).
For a system with deterministic (and accurately modeled)
dynamics x =f( x, u), it is sufficient to find the openloop
sequence of commands u( t) and the associated trajectory
x( t) that minimizes J, which can usually be obtained by
m = 2, for the two motors.3 Even for this lowdimensional system the possible combinations of states and actions that
DP needs to evaluate and store in order to find the optimal
control law are p= 1004 =100,000,000. One way to avoid the curse of dimensionality is to restrict the state space to
a region that is close to a nominal optimal trajectory. In the neighborhood of such trajectories the DP problem can be approximated analytically using Taylor expansions of the dynamics and the cost function. The idea is to compute an optimal trajectory together with a locally valid feedback law and then iteratively improve this nominal solution until convergence. Wellknown examples of such iterative meth ods are Differential Dynamic Programming (DDP) (Dyer and McReynolds 1970; Jacobson and Mayne 1970) or the more recent iterative Linear Quadratic Gaussian (ILQG)
(Todorov and Li 2005), which will serve as solution tech nique of choice in this paper. ILQG yields both an open loop sequence of commands and optimized feedback gain matrices, but these are not guaranteed to converge towards the global optimum: Depending on the initial guess of the trajectory, the iterative improvement might result in a solution with only a locally optimal expected cos J .
Modeling Dynamics and Noise through Learning
Analytical dynamic formulations, as described in Section
2.1 or in Equation (5), have the tremendous advantage of being compact and fast to evaluate numerically, but they also suffer from drawbacks. First, their accuracy is limited to the level of detail put into the physical model. For exam ple, our model is based on the assumption that the robot is completely symmetric, that both motors are perfectly cali brated, and that the two springs are identical, but in reality we cannot avoid small errors in all of these. Second, the
Learning this mapping from data, we can directly account for asymmetries. More interestingly, when we collect data from the perturbed system, we can acquire a model of the arms kinematic variability as a function of the motor positions.
We use this learned model f in two ways: first, in (slow)
position control tasks (Section 3.2), and in conjunction with full analytic dynamic models for dynamic reaching tasks (Section 3.3).
Energy Optimal (Equilibrium) Position Control
Consider the task of holding the arm at a certain position , while consuming as little energy as possible. Let us fur ther assume that we have no feedback from the system,5
but that the arm is perturbed randomly. We can state this mathematically as the minimization of a cost
analytical model does not provide obvious ways to model changes in the dynamics, such as from wear and tear, or
J =
wp
. ( f ( u) )2 + u 
2., (10)
more systematic changes due to the weight of an added tool. While these problems can to some extent be alleviated by a more involved and repeated system identification pro cess, the situation is more difficult if we consider the noise
modelÂ·F( ), or stochastic changes to the dynamics. For example, an arm might be randomly perturbed by tool inter
where wp is a factor that weights the importance of being at the right position against the energy consumption, which
for simplicity we model byu 2. Taking into account that the motor commands u are deterministic, and decomposing
the expected position error into an error of the mean plus the variance, we can write the expected cost J as
actions such as when drilling into a wall, with stronger effects for certain postures, and milder effects for others.
J = wp .
2 p..

2. +  u 2 ,
)
It is not obvious how one can model state dependent noise
(f ( u) ) + w
f ( u) (f ( u)
(11)
analytically.
We therefore propose to include a supervised learning component and to acquire both the dynamics and the noise model in a datadriven fashion (Figure 5). Our method of choice in this paper is Locally Weighted Projection Regres sion (LWPR) or (Vijayakumar et al. 2005), because that algorithm allows us to adapt the models incrementally and
online, and it is able to reflect heteroscedastic4 noise in the
which based on the LWPR learned model becomes
J = wp( f(u) )2 +wp2( u)+ u  2. (12)
Here f( u) and ( u) denote the prediction and the one standarddeviationbased confidence interval of the LWPR
model of f ( u). The constant wp represents the importance
of the accuracy requirements in our task. We then can easily minimize J with respect to u =( u , u )T numerically, taking
training data through localized confidence intervals around 1 2 6
its predictions. More details on learning with LWPR can be found in Appendix A.
In order to simplify the presentation as much as possi ble, and also due to technical challenges of operating on the real hardware (for details see Section 5), in this work we learn the stochastic mapping f ( u) from motor positions to joint angle , not taking into account velocities and accel erations. During stationary conditions and in the absence of perturbations, this mapping reflects the equilibrium posi tion of the arm (Figure 2, left). In correspondence to the
general dynamic equation (8), here the state =x eq repre sents the current equilibrium position, u the applied motor
action, and dx the resulting change in equilibrium position. Therefore the reduced dynamics used here, only depends on the control signals, i.e.
dx = f ( u) dt + F( u) d , N ( 0, 1). (9)
into account the box constraints 0 ui 180 .
Dynamic Control with Learned Stochastic Information
Equilibrium position control is ignorant about the dynam ics of the arm, that is, going from one desired position to the next might induce swinging movements, which are not damped out actively. Proper dynamic control should take these effect into account and optimize the command sequence accordingly. What follows is a description of how we model the full dynamics of the arm, that is, the combination of the dynamics of the joint and the motors.
The state vector x[k] of our system at time k consists of the joint angle x1[k] = [k] and joint velocity x2[k] = [k] as well as 12 additional state variables, which represent the
command history of the two motors, i.e. the last six motor
Cost function
Stochastic Optimal Controller
Execute control law Plant
accelerations
Dynamics + Noise Dynamics data
Dynamics learning
positions, velocities & motor commands
Fig. 5. Schematic diagram of our proposed combination of stochastic optimal control (SOC) and learning. The dynamic model used in SOC is acquired and constantly updated with data from the plant. The learning algorithm extracts the dynamics as well as stochastic information contained (noise model from confidence intervals). SOC takes into account both measures in the optimization.
commands that were applied to the system. The state vector, therefore, is
x[k] = ( [k], [k], u1[k 1], … , u1[k 6],
The gradient of Â¨( x) is given by the chain rule, where is the short notation for ( , , ). Note that= x1, = x2, and and are calculated from x3…14:
u2[k 1], … , u2[k 6])T , (13)
.
1
where the additional state variables x3[k],
… , x8[k] for
z
xÂ¨ =I ,D, p, p,… ,h7,
motor 1, and similarly, x9[k], … , x14[k] for motor 2, are
h ,
2 p, … ,
h7 . (20)
required to represent the FIR filter states of the motor
dynamics from Equation (6). We can estimate the motor
positions [k] and [k] solely from these filter states because the FIR coefficients are h0 = p = 0:
This only shows the second row of the Jacobian xf( x, u) and for brevity we omitted the others as they are trivial. The
7 7 other Jacobian uf( x, u) consists of zero entries apart from
[k] =.
j=2
hju1[k j + 1] =
.
j=2
hjxj+1[k] (14)
the entry 1/t 50 at indices ( 3, 1) and ( 9, 2).
Since the dyn=amics of our system is nonlinear and high
7 7 dimensional, we have to employ an iterative local optimiza
[k] =.
hju2[k j + 1] =
.
hjxj +7[k]. (15)
tion approach. We employ the ILQG method due to its abil ity to include constraints on the commands. More details of
j=2 j=2 the ILQG algorithm can be found in Appendix B.
Based on the rigid body dynamics from Equation (5) we
can compute the acceleration from states (i.e. forward dynamics) as
1 .
I
Â¨[k] = ( [k], [k], [k]) D[k] . (16)
z
Therefore running the dynamics here means account
The usual ILQG formulation is based on an analytically given cost function (deterministic) and a stochastic dynamic function. Here we use a deterministic dynamics (with the idealized analytic model) and we propose a cost function that takes stochastic information into account.
2
ing for motor dynamics by shifting the filter states that is
c( x, u) = wp( x1 )2 +w2vx2 + we  u 
+ wd(( u1 x3)2
xi+1[k + 1] = xi[k] for i = 3, … , 7 and i = 9, … , 13, and
+ ( u2 x9)2
) +wp 2( u) . (21)
then Eulerintegrating the velocities and accelerations:
x[k + 1] = x[k] + t f( x[k], u[k]) (17)
= ( [k] + t[k], [k] + tÂ¨[k], u1 [k],
x3[k], … , x7[k], u2[k],
x9[k], … , x13[k])T . (18)
Alternatively, we can drop the time index k and write the dynamics in.compact form as
All quantities in Equation (21) (also possibly the pre factors) are timedependent, but we have dropped the time indices for notational simplicity. As before wp governs the accuracy requirement. In addition, a stability term wv gov erns the importance of having zero velocity and we penal izes energy consumption at the level of springs. The weight
ing factor wd penalizes changes in motor commands and therefore energy consumption at the level of the servomo
f( x, u)
x , Â¨( x) ,
1
( u x ) ,
1
( x x ) , ,
tor. The last term includes the learned uncertainty in our
= 2
1
t 1 3
1
t 3 4
T
… equilibrium positions, which is here also scaled by wp. This is justified because, for example, for a reaching task, the
arm will finish with the servomotors in a position such that
t ( u2 x8), t ( x8 x9), …
. (19)
the arms equilibrium position is the desired position ,and
we have learned from data how much perturbation we can expect at any such configuration. The same holds true for slow tracking tasks, where the servos will be moved such that the equilibrium positions track the desired trajectory.

Results
In this section we present results from the optimal control model applied to the hardware described earlier in Sec tion 2. We first highlight the adaptation capabilities of this framework experimentally and then show how the learned stochastic information leads to an improved control strat egy over solutions obtained without stochastic information. More specifically the new model achieves higher posi tional accuracy by varying impedance of the arm through motor cocontraction. We study position holding, trajectory tracking and target reaching tasks.
Experiment 1: Adaptation towards a Systematic Change in the System
An advantage of the learned dynamic paradigm is that it allows us to account for systematic changes without prior knowledge of the shape or source of the perturbation. To demonstrate such an adaptation scenario we set up a sys tematic change in the hardware by replacing the left spring, between motor 1 and the joint (i.e. between points A and C in Figure 1), with one that has a lower, unknown spring constant. The aim is to hold a certain equilibrium position using the energy optimal position controller described in Section 3.2. As expected, the prediction about the equilib
rium points (i.e. f( u)) does not match the real, changed
system properties.
Next, we demonstrate how the system can adapt online and increase performance, trial by trial. We specified a tar
get trajectory that is a linear interpolation of 200 steps between the start position 0 = 30 and the target posi
shown by the asymmetric shape. Analyzing the motor com mands (Figure 6 right) shows that the optimal controller, for all trials, chooses the motor commands with virtually no cocontraction. This is a sensible choice as cocontraction would contradict the minimum energy cost function that we have specified.
The Role of Stochastic Informationfor Impedance Control
Because cocontraction and energy consumption are oppos ing properties, our controller will hardly make use of the redundant degree of freedom in the actuation. Even though minimum energy optimal control in an antagonistic system seems to be unable to cocontract it remains our favorite choice of performance index as it also implies compliant movement and as it follows the biological motivation. If we consider the stochastic information that would arise from a task involving random perturbations, we can see that the produced stochasticity holds valuable information about the stability of the system.7 If the uncertainty can be reduced by cocontracting it will be reflected in the data, i.e. in the LWPR confidence bounds. Therefore the answer to the pre vious question is that, given that we wish to achieve high task accuracy, the controller should cocontract whenever it can reduce the expected noise/stochasticity in the system (weighted with the accuracy requirement).
Suppose our system experiences some form of small random perturbations during control. In the hardware we realize such a scenario by adding a perturbation motor at the end of the arm, which mimics, for example, a drilling tool (panel a in Figure 3). The perturbation on the arm is produced by alternating the servomotor positions quickly
every 200 ms from 40 to40. The inertia of the addi tional weight then produces deflections of the arm from the
current equilibrium position. With these perturbations, we
collected new training data and updated the existing LWPR model f. The collected data reveals that the arm stabilizes
tion =
30. We tracked this trajectory by recomputing
in regions with higher cocontraction, where the stiffness
the equilibrium positions, i.e. by minimizing Equation (12)
is higher. This behavior is illustrated in Figure 8, which
at a rate of 50 Hz. At the same time we updated f( u)
shows motion traces around =0
due to the perturbation
during reaching. Due to the nature of local learning algo rithms, f is only updated in the neighborhood of the current
trajectory and therefore shows limited generalization. To account for this, after each trial, we additionally updated the model with 400 training data points, collected from a
motor for different cocontraction levels. This information
is contained in the learned confidence bounds (Figure 9) and, therefore, the optimal controller effectively tries to find the tradeoff between accuracy and energy consumption.
20 Ã— 20 grid of the motors range u1
= u2
= [0, 180].
Experiment 2: Impedance Control for
Figure 6 depicts the outcome of this adaptation experiment.
One can observe that the controller initially (lighter lines) fails to track the desired trajectory (red). However, there is significant improvement between each trial, especially between trials 1 to 5. After about nine trials the internal model has been updated and manages to track the desired trajectory well (up to the hardwares level of precision). The equilibrium position predictions in Figure 7 confirm that the the systematic shift has been successfully learned, as
Varying Accuracy Requirements
Based on the learned LWPR model f from the previous section, we can demonstrate the improved control behav ior of the stochastic optimization with emerging impedance control. We formulate a task to hold the arm at the fixed
positions = 15 and = 0, respectively. While minimiz
ing for the cost function in Equation (12), we continuously
and slowly increased the position penalty within the range
30
Desired
160
9 Motor 1 9
Trial 0
20 Trial 1
Trial 2
Trial 3
3 140
2
Motor commands [deg]
1 120
Motor 2
3
2
Joint angle [deg]
10 Trial 9
0
10
20
30
0 1
100
0
80 0
60 1
2
40 3
9
20
40
0 20 40 60 80 100 120 140 160 180 200
Target index
0
0 20 40 60 80 100 120 140 160 180 200
Target index
Fig. 6. Visualization of the adaptation process. Left: Desired (red) and observed arm positions. Right: Motor commands for the corresponding trials. Darker and thinne lines indicate later stages of learning.
wp =[102, 105]. The left column in Figure 10 summarizes the results we discuss next: At w=p 102 to approximately wp =100 the optimization neglects position accuracy and minimizes mainly for energy, i.e. u=1 u2 = 0. The actual joint positions, because of the perturbations, oscil
late around the mean =0 as indicated by the shaded
the cost function (21) by setting the weighting terms as follows: the timedependent position penalty is a mono tonically increasing linear interpolation of 100 steps, i.e. wp[t] = [0.1, 0.2, … , 10]. The penalty for zero endpoint
velocity was set to wv[t] = 0 for 0 < t < 80 and wv[t] = 1
for t 80. The energy penalties are assumed constant
area. Between wp =100 and wp
=102 the position con
we = wd = 1 during the whole movement.
straint starts to catch up with the energy constraint; a shift in the mean position towards can be observed. At about
Â·
wp =5 101, the variance in the positions increases as the periodic perturbation seems to coincide with the resonance
frequency of the system. For wp > 102 the stochastic infor mation is weighted sufficiently such that the optimal solu tion increases the cocontraction and the accuracy improves further.
In contrast, if we run the same experiment while ignoring the stochastic part of the cost function, i.e. we minimize for
By using ILQG, we then compute an optimal control
sequence uÂ¯ with the corresponding desired trajectory xÂ¯ and a feedback control law L. Figure 12 depicts the reaching
performance of the ILQG trajectory, applied in openloop mode and in closedloop mode (i.e. using feedback law L), where the robot has been perturbed by a manual push. The closedloop scheme successfully corrects the perturbation and reaches the target while the openloop controller oscil lates and fails to reach the target. This experiment high lights the benefits of closedloop optimization which can,
the deterministic cost function J
= wp ( f( u) )2 + u  2
by incorporating the full dynamic description of the sys
only, we can can see (Figure 11) that the system does, as
expected, not cocontract and hardly improves performance accuracy.
4.4. Experiment 3: ILQGReaching Task with a Stochastic Cost Function
For certain tasks, such as quick target reaching or faster tracking of trajectories, the system dynamics based on equi
librium points =f ( u) may not be sufficient, as it contains no information about the velocities and accelerations of the
system. Next, we assume a full forward dynamic descrip tion of our system as identified in Equation (13), where the state consists of joint angles, joint velocities, and twelve motor states.
The task is to start at position 0 = 0 and reach
tem, account for such perturbations. However, the ability to correct perturbations is limited by the hardware control bandwidth (i.e. slow servomotor dynamics and 50 Hz con trol board frequency). If the system also suffers from feed back or motor delays the correction ability is limited and for example accounting for vibrations or noise8 is difficult to achieve using feedback signals only. For such stochastic perturbations, impedance control can improve performance as it changes the mechanical properties of the system in a feedforward manner, i.e. it reduces the effects of the perturbations in the first place.
To realize such a scenario, we defined a tracking task that starts at the zero position then moves away and back again along a sinusoidal curve for 2.5 s. The cost function parameters for this task are defined as follows: The time
dependent position penalty is wp[t] = [50, 100, … , 4,000]
towards the target = 0.3 rad (= 17.18). The reaching for 0 < t < 80 and wp[t] = 4,000 for t 80. The endpoint
movement duration is fixed at 2 s, which corresponds to
T = 100 discretized time steps at the hardwares opera tion rate of 50 Hz. This task can be formalized based on
velocity term is wv[t] = 0 for 0 < t < 80 and wv[t] = 10 for t 80. The energy penalties are held constant, i.e. we = wd = 1.
Learned position after 0 iterations
180
15
180
Learned position after 1 iterations
15
150 150
u [deg]
u [deg]
120 120
1
1
90 90
60 60
30 30
0
0 30 60 90 120 150 180
0
0 30 60 90 120
150 180
u [deg] u
2 2
[deg]Learned position after 3 iterations
180
180
Learn ed position after 9 iteratio ns
150 150
u [deg]
u [deg]
120 120
1
1
90 90
60 60
30 30
0
0 30 60 90 120 150 180
0
0 30 60 90 120 150
180
u [deg] u
2 2
[deg]Fig. 7. Learned position models during the adaptation process. The white numbers represent the equilibrium points.
As before, we observe the benefits of using stochastic information for optimization compared to a determinis tic optimization (not using the LWPR confidence bounds). After computing the optimal control using ILQG we ran the optimal feedback control law (see Appendix B) con secutively 20 times in each condition, i.e. with and without stochastic optimization. Note that the perturbation motor is switched on at all times. Figure 13 summarizes the results: as expected the stochastic information in the cost function induces a coactivation for the reaching task, which shows generally better performance in terms of reduced variabil ity of the trajectories. Evaluating the movement variability where the accuracy weight is maximal, i.e. for t > 80, the standard deviation of the trajectories is significantly lower with stoch = 0.55 for the stochastic optimization com
pared to the deterministic optimization with det = 1.38.
A detailed look at the bottom right plot in Figure 13 reveals a minor shift in the recorded trajectory compared to the planned one from the analytic model. We attribute this error to imprecisions in the hardware, i.e. tiny asymme tries, which are not included in the analytic model. In the case of higher cocontraction, small manufacturing errors and an increased joint friction lead to deviations towards the idealized analytic model predictions. Indeed the learned dynamic model can account for these asymmetries as can be seen in Figure 9 (left), along the equilibrium position = 0, i.e. the line u1 = u2 is slightly skewed.

Conclusion and Outlook
In this paper we have presented a stochastic optimal control model for antagonistically actuated systems. We proposed
Fig. 8. Motion traces of our SEA hardware around = 0. The perturbation motor causes different deflections depending on the co contraction levels: (a) u1 = u2 = 0, (b) u1 = u2 = 45, (c) u1 = u2 = 120.
180
150
u [deg]
120
1
90
60
30
0
Learned Equilibrium Position
50
0
50
0 30 60 90 120 150 180
180
150
u [deg]
120
1
90
60
30
0
Confidence Interval/Noise Level
0 30 60 90 120 150 180
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
u [deg]
2
u [deg]
2
Fig. 9. Left: Learned equilibrium position as a function of the motor positions (in degrees), with contour lines spaced at 5 intervals. Right: Stochastic information given by the heteroscedastic confidence intervals of LWPR.
to learn the dynamics as well as the stochastic information of the controlled system from sensorimotor feedback of the plant. This control architecture can account for a system atic change in the system properties (Experiment 1) and, furthermore, is able, by incorporating the heteroscedastic prediction variances into the optimiation, to compensate for stochastic perturbations that were induced in the plant.
Doing so, our control model demonstrated significantly better accuracy performance than the deterministic opti mization in both energyoptimal equilibrium point control (Experiment 2) and energyoptimal reaching using dynamic optimization (Experiment 3). The improved behavior was achieved by coactivating antagonistic motors, i.e. by using the redundant degree of freedom in the system based on the
25
Desired pos.
20 Actual pos.
15
degrees
10
5
0
5
15
10
degrees
5
0
5
10
Desired pos. Actual pos.
10 2 0
10 10
2 4
10 10
6 15 2 0
10 10 10
2 4 6
10 10 10
Wp
Motor 1
Motor 2
120
100
Wp
Motor 1
100
degrees
80
60
40
20
0
90 Motor 2
80
70
degrees
60
50
40
30
20
10
2 
0 
2 
4 
6 
2 
0 
2 
4 
6 

10 
10 
10 Wp 
10 
10 
10 
10 
10 Wp 
10 
10 
0
Fig. 10. Experiment with increasing position penalty wp for two targets. Left plot column: = 15; right plot column: = 0. The plots show the desired vs. measured position and corresponding motor commands as a function of the accuracy requirement (prefactor wp). The shaded area results from the perturbation motor.
first principles of optimality. The presented results demon strate that this is a viable optimal control strategy for real hardware systems that exhibit hardtomodel system prop erties (e.g. asymmetries, systematic changes) as well as stochastic characteristics (e.g. using a power tool) that may be unknown a priori.
An advantage of the presented control architecture is that motor coactivation (or impedance) does not need to be specified explicitly as a control variable but emerges from the actual learned stochasticity within the system (scaled with the specified accuracy requirements of the task). Therefore, coactivation (i.e. higher impedance), since it is energetically expensive, will only be applied if it actually is beneficial for the accuracy of the task.
Exploiting stochasticity in wider domains The method ology we suggest for optimal exploitation of sensorimo tor stochasticity through learning is a generic principle that goes beyond applications to impedance modulation of antagonistic systems but can be generalized to deal with any kind of control or state dependent uncertainties. For exam ple, if we wish to control a robot arm that suffers from poor repeatability in certain joint angles or in a particular range of velocities, this would be visible in the noise landscape
(given one has learned state dependent stochastic dynamics) and consequently those regions would be avoided by the optimal controller. In this context, the source of the stochas ticity is irrelevant for the learner and therefore, it could arise from internal (i.e. noise in the motor), as well as external (i.e. power tool) sources. However, the stochastic system properties must, to a certain degree, be stationary in time so that the learner can acquire enough information about the noise landscape.
Biological relevance As mentioned in the introduction, biological systems are often used as a benchmark for the control of artificial systems. In this work not only the antag onistic hardware but also the actual control architecture is motivated by biological principles. Optimality approaches have been a very fruitful line of research (Todorov 2004; Scott 2004; Shadmehr and Krakauer 2008) and its combi nation with a learning paradigm (Mitrovic et al. 2008) is biologically justified a priori, since the sensorimotor sys tem can be seen as the product of an optimization process (i.e. evolution, development, learning, adaptation) that con stantly learns to improve its behavioral performance (Li 2006). Indeed, internal models play a key role in efficient human motor control (Davidson and Wolpert 2005) and it
30
25
20
degrees
15
10
5
0
5
10 2 0
10 10
2 4
10 10
Desired pos. Actual pos.
6
10
10
8
6
4
degrees
2
0
2
4
6
8
10 2 0
10 10
2 4
10 10
Desired pos. Actual pos.
6
10
Wp
60
Motor 1
Motor 2
50
degrees
40
30
20
10
0
Wp
20
Motor 1
18 Motor 2
16
14
degrees
12
10
8
6
4
2
0
2 0 2 4 6
2 0 2 4 6
10 10 10 10
Wp
10 10 10 10 10 10
Wp
Fig. 11. The same experiment as in Figure 10 where the stochastic information was not incorporated into the optimization.
20 10
Desired
15 Closed Loop Perturb 5
Open Loop Perturb
10
0
Arm Position [deg]
Arm Position [deg]
5
Desired
Closed Loop
Open Loop
Closed Loop Perturb
Open Loop Perturb
0 5
5
10
15
20
25
30
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time [s]
10
15
20
25
30
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time [s]
Fig. 12. ILQGreaching task without stochastic information used in openloop and closedloop control. We perturbed the arm by hitting it once after 0.4 s (left plot) and 1.2 s (right plot), respectively. The dashed lines in the right plot represent unperturbed trials (open loop and closed loop).
has been suggested that the motor system forms an internal forward dynamic model to compensate for delays, uncer tainty of sensory feedback, and environmental changes in a predictive fashion (Wolpert et al. 1995; Kawato 1999; Shad mehr and Wise 2005). Notably a learned optimal trade off between energy consumption, accuracy and impedance has been repeatedly observed in human impedance con trol studies (Burdet et al. 2001; Franklin et al. 2008). More specifically, the amount of impedance modulation in
humans seems to be governed by some measure of uncer tainty, which could arise from internal (e.g. motor noise) or external (e.g. tools) sources (Selen et al. 2009).
In the computational model presented here, these uncer tainties are represented by the heteroscedastic confidence bounds of LWPR and integrated into the optimization process via the performance index (i.e. cost function). Such an assumption is biologically plausible, since humans have the ability to learn not only the dynamics but also the
70 Motor 1 70
Motor 2
Motor commands [deg]
Motor Commands [deg]
60 60
50 50
40 40
30 30
20 20
10
0
0 0.5 1 1.5 2
10
0
0 0.5 1 1.5 2
Motor 1
Motor 2
Desired position
Real position, determ. plan
10 10
Arm Position [deg]
Arm Position [deg]
5 5
0 0
5 5
10 10
Desired position
Real position, stoch. plan
15
0 0.5 1 1.5 2
Time [s]
15
0 0.5 1 1.5 2
Time [s]
Fig. 13. Twenty trials of ILQG for a tracking task of 2.5 s. Left column: Deterministic optimization does not exhibit cocontraction. Right column: Twenty trials of ILQG using stochastic information in the cost function. The system cocontracts as the accuracy requirements increase.
stochastic characteristics of tasks, in order t optimally learn the control of a complex task (Chhabra and Jacobs 2006; Selen et al. 2009).
Hardware Limitations and Scalability This work repre sents an initial attempt to modulate the impedance of a real antagonistic system in a principled fashion. The proposed SEA has been primarily designed to perform as a proof of concept of our control method on a real system. Specifi cally we can identify several limitations of our system that need further investigation in the future.
First, the stiffness range of the system is fairly low as spring nonlinearities are achieved by the geometric effect of changing the moment arms. There are other, mechani cally sophisticated, SEA designs with large stiffness ranges (e.g. Grebenstein and van der Smagt 2008; van Ham et al. 2009), which also could serve as attractive implementation platforms for our algorithm. Specifically the MACCEPA design (van Ham et al. 2007) is very appealing as it is tech nically simple and offers a large stiffness range; however, parallels to biologically realistic implementations are less obvious in this design, as the system is not antagonistically
actuated. The fact that we were able to obtain a significant increase in cocontraction from the learned stochastic infor mation, even for hardware with a very low stiffness range is promising, indicating good resolution capabilities of the localized variance measure in LWPR.
Second, the relatively slow control loop (50 Hz) causes controllability issues (i.e. slow feedback) and, furthermore, turned out to be sensitive to numerical integration errors within ILQG. While these numerical issues have not caused problems in an analytic dynamic formulation (Experiment 3), they turned out to be critical when we run ILQG using
the full learned forward dynamics f(x, u). Under these con
ditions, for most of the time ILQG does not converge to a reasonable solution. A potential route of improvement could be a combination of LWPR learning with an analytic model. Instead of ignoring valuable knowledge about the system given in analytic form, one could focus on learning an error model only, i.e. aspects of the dynamics that are not described by the analytic model.
Third, the transfer of optimal controls from simulation to the real hardware has proven to be very challenging. Cur rently we are computing ILQG solutions for a fixed time
horizon and later applying them to the SEA. For slower movements this approach produces satisfying accuracy. In Experiment 3 we enforced slower and smoother move ments by formulating an appropriate timedependent cost function. However, for movements with higher frequency the situation is more difficult: Errors accumulate on the hardware over the course of the trajectory, since the feed back loop for corrections is very slow. This leads to solu tions that differ significantly from the preplanned optimal solution. A potential route to resolve this problem is to use a model predictive control approach in which the opti mal solutions are recomputed during control with current states of the plant as initial states. However, this approach requires computationally efficient recomputations of the optimal control law, which may be hard to obtain, especially for systems with higher dimensionality.
Finally, our experiments were carried out on a low dimensional system with a single joint and two motors. Implementations on systems with higher dimensionality, however, are still very challenging as the construction of antagonistic robots is nontrivial and the availability of large degrees of freedom systems is very limited. Due to the curse of dimensionality, highdimensional systems impose serious computational challenges on both optimal control methods and machine learning techniques. While some of these issues have been addressed in previous work (Todorov et al. 2005; Mitrovic et al. 2008, 2010), we believe that the study of impedance control based on stochastic senso rimotor feedback is a promising route of research for both robotic and biological systems.
Funding
This work was partially supported by the EU FP7 Project STIFF and by a Royal Academy of Engineering and Microsoft Research Fellowship to SV.
Notes

This means we put expectation brackets around the integrals and h( Â·) in (7).

For the infinitehorizon case, the matrix is constant.

Note that we have ignored any motor dynamics.

Stability here refers to the desired equilibrium position.

Heteroscedastic noise has different variances across the state and action space. For example, the variance of the noise can scale with the magnitude of the control signal u, which is also called signal dependent noise.

Alternatively, assume the feedback loop is so slow that it is practically unusable.

For our SEA this optimization can be performed in real time,
i.e. at least 50 times per second, which corresponds to the maximum control frequency of our system (50 Hz).

or any other highfrequency perturbation.
References
Bellman R (1957) Dynamic Programming. Princeton, NJ: Prince ton University Press.
Bertsekas DP (1995) Dynamic programming and optimal control.
Belmont, MA: Athena Scientific.
Bobrow J, Dubowsky S and Gibson J (1985) Timeoptimal control of robotic manipulators along specified paths. The International Journal of Robotics Research 4(3): 317.
Burdet E, Osu R, Franklin DW, Milner TE and Kawato M (2001) The central nervous system stabilizes unstable dynamics by learning optimal impedance. Nature 414: 446449.
Chhabra M and Jacobs RA (2006) Nearoptimal human adap tive control across different noise environments. Journal of Neuroscience 26: 1088310887.
Collins S and Ruina A (2005) A bipedal walking robot with efficient humanlike gait. In Proceedings of the IEEE Inter national Conference on Robotics and automation (ICRA), pp. 19831988.
Collins SH and Kuo AD (2010) Recycling energy to restore impaired ankle function during human walking. PLoS ONE 5 pagese9307.
Cortes J, Martinez S, Ostrowski J and McIsaac KA (2001) Opti mal gaits for dynamic robotic locomotion. The International Journal of Robotics Research 20: 707728.
Davidson PR and Wolpert DM (2005) Widespread access to pre dictive models in the motor system: a short review. Journal of Neural Engineering 2: 313319.
Dyer P and McReynolds S (1970) The Computational Theory of Optimal Control. New York: Academic Press.
Faisal AA, Selen LPJ and Wolpert DM (2008) Noise in the nervous system. Nature Reviews Neuroscience 9: 292303.
Flash T and Hogan N (1985) The coordination of arm movements: an experimentally confirmed mathematical model. Journal of Neuroscience 5: 16881703.
Franklin D, Burdet E, Tee KP, Osu R, Chew CM, Milner TE, et al. (2008) CNS learns stable, accurate, and efficient movements using a simple algorithm. The Journal of Neuroscience 28: 1116511173.
Grebenstein M and van der Smagt P (2008) Antagonism for a highly anthropomorphic handarm system. Advanced Robotics 22: 3955.
Hogan N (1984) Adaptive control of mechanical impedance by coactivation of antagonist muscles. IEEE Transactions on Automatic Control 29: 681690.
Hurst JW, Chestnutt J and Rizzi A (2004) An Actuator with Mechanically Adjustable Series Compliance. Technical Report CMURITR0424, Robotics Institute, Carnegie Mellon University.
Jacobson DH and Mayne DQ (1970) Differential Dynamic Pro gramming. New York: Elsevier.
Kawato M (1999) Internal models for motor control and trajectory planning. Current Opinion in Neurobiology 9: 718727.
Laffranchi M, Tsagarakis NG and Caldwell DG (2010) A vari able physical damping actuator (VPDA) for compliant robotic joints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Li W (2006) Optimal Control for Biological Movement Systems.
PhD dissertation, University of California, an Diego, CA.
Li W and Todorov E (2007) Iterative linearization methods for approximately optimal control and estimation of non linear stochastic system. International Journal of Control 80: 14391453.
Migliore SA, Brown EA and DeWeerth SP (2005) Biologically inspired joint stiffness control. In Proceedings of the 2005
IEEE International Conference on Robotics and Automation (ICRA), pp. 45084513.
Mitrovic D, Klanke S and Vijayakumar S (2008) Adaptive opti mal control for redundantly actuated arms. In Proceedings of the 10th Internatio nal Conference on Simulation of Adaptive Behaviour (SAB), Osaka, Japan.
Mitrovic D, Klanke S and Vijayakumar S (2010) Optimal feed back control for anthropomorphic manipulators. In Proceed ings of the IEEE International Conference on Robotics and Automation (ICRA).
Nakamura Y and Hanafusa H (1987) Optimal redundancy control of robot manipulators. The International Journal of Robotics Research 6: 3242.
Paluska D and Herr H (2006) The effect of series elasticity on actuator power and work output: Implications for robotic and prosthetic joint design. Robotics and Autonomous Systems 54: 667673.
Pratt GA and Williamson MM (1995) Series elastic actuators. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 399406.
Scott SH (2004) Optimal feedback control and the neural basis of volitional motor control. Nature Reviews Neuroscience 5: 532546.
Selen LP, Franklin DW and Wolpert DM (2009) Impedance con trol reduces instability that arises from motor noise. Journal of Neuroscience 40(9): 1260612616.
Shadmehr R and Krakauer JW (2008) A computational neu roanatomy for motor control. Experimental Brain Research 185: 359381.
Shadmehr R and Wise S (2005) The Computational Neurobiology of Reaching and Pointing. Cambridge, MA: The MIT Press. Stengel RF (1994) Optimal control and estimation. New York:
Dover Publications.
Todorov E (2004) Optimality principles in sensorimotor control.
Nature Neuroscience 7: 907915.
Todorov E (2006) Optimal Control Theory. In Doya K (Ed.), Bayesian Brain: Probabilistic Approaches to Neural Coding. Cambridge, MA: MIT Press, pp. 269298.
Todorov E and Li W (2005) A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the American Control Conference.
Todorov E, Li W and Pan X (2005) From task parameters to motor synergies: A hierarchical framework for approximatelyoptimal control of redundant manipulators. Journal of Robotic Systems 22: 691710.
Tonietti G, Schiavi R and Bicchi A (2005) Design and con trol of a variable stiffness actuator for safe and fast physi cal human/robot interaction. IEEE International Conference Robotics and Automation (ICRA) pp. 526531.
Uno Y, Kawato M and Suzuki R (1989) Formation and control of optimal trajectories in human multijoint arm movements: minimum torquechange model. Biological Cybernetics 61: 89101.
van Ham R, Sugar T, Vanderborght B, Hollander K and Lefeber
D (2009) Compliant actuator designs. IEEE Robotics and
Design and implementation in a biped robot. Robotics and Autonomous Systems 55: 761768.
Vanderborght B, Van Ham R, Lefeber D, Sugar TG and Hollan der KW (2009) Comparison of mechanical design and energy consumption of adaptable, passivecompliant actuators. The International Journal of Robotics Research 28: 90103.
Vijayakumar S, DSouza A and Schaal S (2005) Incremental online learning in high dimensions. Neural Computation 17: 26022634.
Wolf S and Hirzinger G (2008) A new variable stiffness design: Matching requirements of the next robot generation. In Pro ceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Wolpert DM, Ghahramani Z and Jordan MI (1995) An internal model for sensorimotor integration. Science 269: 18801882.
Zinn M, Khatib O, Roth B and Salisbury JK (2004) Playing it safe. IEEE Robotics and Automation Magazine 11: 1221.
Appendix A: Learning with LWPR
In order to learn the plant dynamics various supervised learning algorithms could be applied. Here we focus on local learning methods, which represent a function by using small simplistic patches, e.g. firstorder polynomials. The size of the locality is determined by gating activation ker nels, and the positions and number of the local kernels are adapted during learning to represent the nonlinear func tion. Because the input data activates only local patches, local learning algorithms are robust against global nega tive interference. This ensures the flexibility of the learned model towards systematic changes in the dynamic proper ties of the arm (e.g. load, material wear). Furthermore, the domain of realtime robot control demands certain proper ties of a learning algorithm, namely fast learning rates and high prediction speeds at runtime if the model is trained incrementally. LWPR has been shown to exhibit these prop erties, and to be very efficient for incremental learning of nonlinear models (Vijayakumar et al. 2005).
In LWPR, the regression function is constructed by blending local linear models, each of which is endowed with a locality kernel that defines the area of its validity (also termed its receptive field). During training, the parameters of the local models (locality and fit) are updated using incre mental Partial Least Squares, and models can be pruned or added on an asneed basis, for example, when training data is generated in previously unexplored regions. Usually the receptive fields of LWPR are modeled by Gaussian kernels, so their activation or response to a query vector z (here the inputs are the two motor commands u) is given by
.
1 T , (22)
Automation Magazine 16(3): 8194.
van Ham R, Vanderborght B, Van Damme M, Verrelst B and
wk( z) = exp
2 ( z ck)
Dk( z ck )
Lefeber D (2007) MACCEPA, the mechanically adjustable compliance and controllable equilibrium position actuator:
where ck is the center of the kth linear model and Dk is its distance metric. Treating each output dimension separately
for notational convenience, the regression function can be written as
K K
1 .
f( z) =W
.
wk( z) k ( z), W =
wk( z) , (23)
k=1 k=1
0 T
k( z) = bk + bk ( z ck) , (24)
k
where b0 and bk denote the offset and slope of the kth model, respectively.
LWPR learning has the desirable property that it can be carried out online, and moreover, the learned model can be adapted to changes in the dynamics in realtime. Further more, the statistical parameters of LWPR regression models provide access to the confidence intervals, here termed con fidence bounds, of new prediction inputs (Vijayakumar et al. 2005). In LWPR the predictive variances are assumed to evolve as an additive combination of the variances within a local model and the variances independent of the local
Fig. 14. Typical regression function (blue continuous line) using LWPR. The dots indicate a representative training data set. The receptive fields are shown as ellipses drawn at the bottom of the plot. The shaded region represents the confidence bounds around the prediction function. The confidence bounds grow between z =[5..6] (no training data) and generally towards larger z values (noise grows with larger values).
2 i i i i
model. The predictive variance estimates predf,ok r the kth local model can be computed by analogy with ordinary
xÂ¯ [k + 1] = xÂ¯ [k] + t f( xÂ¯ [k], uÂ¯ [k]). Next, the discretized dynamics (Equation (5)) are linearly approximated as
linear regression. Similarly one can formulate the global variances 2 across models. By analogy with Equation (23), LWPR then combines both variances additively to form the
x[k + 1] =
.
.
f
x .
I + t
xÂ¯ [k]
x[k] + t u[k.
f .
u .uÂ¯ [k]
confidence bounds given by (26)
.
2 1 K
o pred = W 2 wk( z) 2 +
.K 2
wk ( z) pred,k . (25)
Similarly one can derive a quadratic approximation of the cost function around xÂ¯i [k] and uÂ¯ i [k]:
1
k=1 k=1 cost[k] = q[k] + x[k]Tq[k] + x[k]TQ[k]x[k] (27)
The local nature of LWPR leads to the intuitive require ment that only receptive fields that actively contribute to the prediction (e.g. large linear regions) are involved in the actual confidence bounds calculation. Large confidence
bound values typically evolve if the training data contains where
2 +
u[k]Tr[k] + 1 u[k]TR[k]u[k] +
2
u[k]TP[k]x[k]
much noise and other sources of variability, such as chang v[k]
ing output distributions. Further regions with sparse or no
trainingdata, i.e. unexplored regions, showlarge confidence
q[k] = tv[k] q[k] = t x
..
xÂ¯ [k]
(28)
bounds compared with densely trained regions. Figure 14 depicts the learning concepts of LWPR graphically for a learned model with one input and one output dimension.
2v[k]
.
Q[k] = t . x x
v[k] .
xÂ¯[k],uÂ¯ [k]
P[k] = t
2v[k]
.
u x .xÂ¯[k],uÂ¯ [k]
2 v[k] .
The noisy training data was drawn from an example func tion that becomes more linear and more noisy for larger z
r[k] = t
u .
uÂ¯ [k]
R[k] = t
u u .
.
uÂ¯ [k]
values. Furthermore, in the range z
= [5..6] no data was
Both approximations are formulated as deviations xi[k]
sampled for training to show the effects of sparse data on
=
i i i i
LWPR learning.
xi[k] xÂ¯[k] and u [k] =u [k]
uÂ¯[k] of the current opti
Appendix B: The ILQG Algorithm
The ILQG algorithm starts with a timediscretized initial guess of an optimal control sequence and then iteratively improves it with respect to the cost function. From the
initial control sequence uÂ¯ i at the ith iteration, the corre sponding state sequence xÂ¯ i is retrieved using the determin istic forward dynamics f with a standard Euler integration
mal trajectory and therefore form a local LQG problem.
This linear quadratic problem can be solved efficiently via a modified Ricattilike set of equations that yields an affine
control law [k]( =x) l[k] +L[k]x[k]. This control law has a special form: since it is defined in terms of deviations
of a nominal trajectory and since it needs to be implemented iteratively, it consists of an openloop component l[k] and a feedbackcomponent L[k]x[k]. The actual optimization in ILQG supports constraints for the control variable u, such as lower and upper bounds. After the optimal control signal
correction Â¯ui has been obtained, it can be used to improve the current optimal control sequence for the next iteration using uÂ¯i+1[k] = uÂ¯ i[k] + Â¯ui[k]. Finally, uÂ¯ i+1[k] is applied to
where x[k] represents the real plant position and x[kÂ¯] the desired position at time k.
the system dynamics (Equation (5)) and the new total cost
cost function
L,
feedback
along the trajectory is computed. The algorithm stops once the cost v cannot be significantly decreased anymore. After
(incl. target)
x x
controller
u
convergence, ILQG returns an optimal control sequence uÂ¯
and a corresponding state sequence xÂ¯ (i.e. trajectory). Along
dynamics model
ILQG
u u + u plant
+ (SEA)
with the openloop parameters xÂ¯and u,Â¯ ILQG produces a feedback matrix L which may serve as optimal feedback
gains for correcting local deviations from the desired tra jectory of the plant (Figure 15). The control law for each time step k is defined as
u[k]plant = uÂ¯ [k] + u[k] (29)
u[k] = L[k] Â· (x[k] xÂ¯ [k]) , (30)
Fig. 15. The optimal feedback control scheme using ILQG.