馃敀
Trusted Scholarly Publisher
Serving Researchers Since 2012

On the Degradation of PPO Advantage Estimates Under Non-Stationary Aerodynamic Disturbances: A Theoretical Bound and Empirical Validation

DOI : 10.5281/zenodo.20577530
Download Full-Text PDF Cite this Publication

Text Only Version

On the Degradation of PPO Advantage Estimates Under Non-Stationary Aerodynamic Disturbances: A Theoretical Bound and Empirical Validation

Sahil Patil

Department of Aerospace Engineering Indian Institute of Technology Bombay (IIT Bombay) Mumbai, India

Nasiruddin Kabir

Department of Aerospace Engineering Indian Institute of Technology Bombay (IIT Bombay) Mumbai, India

AbstractProximal Policy Optimization (PPO) is now the default choice for training autonomous UAV ight controllers, but its convergence guarantees assume a stationary Markov Decision Process. Real aerospace environments break this assumption: atmospheric turbulence, wind shear, and gust loading inject stochastic perturbations whose intensity varies from rollout to rollout. We study how the Generalized Advantage Estimation (GAE) bias scales with disturbance intensity . For a quadratic value function driven by a zero-mean disturbance, we prove that the expected bias admits an upper bound that is quadratic in

: A() C(, T, , ) 路 2, with C depending on the policy

parameters, rollout length T , the GAE decay parameter , and discount factor . We give a closed form for C in terms of the spectral norm of the LQR value-function Hessian and the closed-loop disturbance-propagation gain. We test the bound in two settings. An analytical LQR baseline, whose linear policy stays stable at every disturbance level, follows the quadratic law almost exactly (R2 = 0.9999). A learned Stable-Baselines3 PPO controller, trained independently at each of eight disturbance

levels ( {0.00, 0.05, 0.10, 0.20, 0.30, 0.50, 0.80, 1.00} m/s) and

evaluated over 200 episodes per level, departs sharply from the pure quadratic prediction: the policy collapses under strong turbulence, so a quadratic t reaches only R2 = 0.845 and a free power-law t gives an exponent of 3.44. We then derive a critical threshold beyond which the bias exceeds a chosen fraction of the nominal advantage scale, giving practitioners a concrete design criterion. To our knowledge this is the rst formal, quantitative link between turbulence intensity and PPO training degradation in an aerospace RL setting.

Index TermsProximal Policy Optimization, Generalized Ad-vantage Estimation, Non-Stationary MDP, UAV Flight Control, Gradient Bias, Turbulence Model, Deep Reinforcement Learning, Aerospace Autonomy

  1. Introduction

    Deep reinforcement learning (DRL) has become a seri-ous contender for autonomous aerospace control, spanning quadrotor attitude stabilization, xed-wing trajectory tracking, and multi-UAV conict resolution [1][3]. Among policy-gradient methods, PPO [4] is the one practitioners reach for most often, thanks to its stability, simple implementation, and better sample efciency than TRPO or on-policy A3C. Much of that stability comes from Generalized Advantage

    Estimation (GAE) [5], which builds a low-variance estimate of the advantage function from sampled rollouts.

    PPOs theoretical guarantees lean on an assumption that is easy to overlook: the underlying MDP is stationary, with transition dynamics T (st | s, a) xed across every time step and episode. On controlled benchmarks such as MuJoCo locomotion this holds well enough. Aerospace is different. The operating environment is non-stationary by nature, i.e. turbulence, wind shear, and thermal gradients all perturb the vehicle dynamics, and their intensity drifts continuously. The Dryden and von Ka麓rma麓n models [6] that dominate aerospace simulation treat these perturbations as colored Gaussian noise whose intensity tracks a severity parameter .

    Even though these disturbances are everywhere in practice, we are not aware of any prior work that quanties how erodes the quality of PPOs advantage estimates, or that pins down a threshold past which convergence is at risk. The consequence is that practitioners pick in simulation by feel, with no principled rule for when the disturbance level starts to undermine PPOs assumptions.

    This paper makes three contributions:

    • We derive an upper bound on the expected GAE advan-tage estimation error A() that is an explicit quadratic function of , with a closed-form constant C(, T, , ). The quadratic scaling, rather than linear, falls directly out of pairing a quadratic value function with a zero-mean disturbance.

    • We test the bound in two regimes: an analytical LQR controller, which obeys the quadratic law almost exactly, and an independently trained Stable-Baselines3 PPO con-troller, which departs from it once the learned policy begins to collapse.

    • We derive the critical turbulence threshold as a func-

    tion of a prescribed bias tolerance f , giving aerospace RL practitioners a concrete criterion for designing simulation environments.

  2. Background and Related Work

    A. Proximal Policy Optimization

    PPO [4] maximizes a clipped surrogate objective to keep policy updates from destabilizing training. At iteration k it collects a rollout dataset Dk = {(st, at, rt)} under the current policy k , estimates advantages At with GAE, and maximizes

    t

    t

    t

    t

    t

    LCLIP() = E min r ()A , clip(r (), 1 , 1+ )A ,

    (1)

    B. Disturbed Dynamics

    We inject a crosswind into the y-velocity component at each step, modelling a lateral gust:

    st+1 = A st + B at + E wt, wt N (0, 2), (5)

    recovering the nominal MDP. The gust is i.i.d. and zero-mean at each step (the white-noise limit of the Dryden spectrum),

    where E = (0, 0, 0, 1)T steers the wind into the lateral velocity state and 0 is the turbulence intensity, with = 0

    where rt() = (at | st)/k (at | st) is the importance ratio.

    The advantage estimate At sets both the direction and the size of the update, so a corrupted estimate points the gradient the wrong way, and clipping, which only limits the magnitude of any single step, does nothing to x a systematically biased direction.

    B. Generalized Advantage Estimation

    GAE [5] forms the advantage estimate as an exponentially weighted sum of TD residuals:

    T t1

    which is enough to expose the scaling law we analyze. A fully colored Dryden process is discussed in Section VI.

    C. Policy and Value Function

    We use a linear-quadratic regulator (LQR) policy (s) =

    Ks, with gain K taken from the discrete-time innite-horizon LQR solution. The optimal value function V (s) =

    sTPs, where P is the unique positive-denite solution

    of the discrete-time algebraic Riccati equation (DARE), is available in closed form, so we can compute GAE errors

    exactly and keep value-approximation error from confound-

    t

    AGAE =

    Ll=0

    ()l t+l, (2)

    ing the analysis. The Riccati solution P has diagonal en-tries [12.999, 4.848, 12.999, 4.848], obtained by iterating the

    where t

    = rt

    + V (s

    t+1

    ) V (st) is the one-step TD error,

    DARE recursion to a xed-point tolerance of 1012 (reached

    in 141 iterations). The key structural fact is that V is

    [0, 1] is the decay parameter, and is the discount factor.

    When V is t under nominal dynamics but evaluated along a disturbed trajectory, every TD residual picks up error from two sources at once: the reward discrepancy and the value-function mismatch at the perturbed states.

    C. Non-Stationarity in Aerospace RL

    Non-stationary dynamics have been studed through domain randomization [7] and robust MDPs [8], but those analyses target the policy performance gap under distribution shift, not the statistics of the advantage estimator itself. The closest prior work is Padakandla et al. [9], who studied Q-learning under slowly varying MDPs but did not treat actor-critic methods or GAE. On the aerospace side, Bohn et al. [10] and Panerati et al. [11] showed empirically that turbulence hurts PPO training, but neither characterized the rate of degradation. That rate is what we set out to quantify.

  3. Problem Formulation

    A. Nominal MDP

    We model a UAV in a 2D position-velocity state space, with state s = (x, vx, y, vy)T R4 and continuous action a = (ax, ay)T R2 giving commanded acceleration increments.

    The nominal discrete-time dynamics with timestep t are

    st+1 = A st + B at, (3)

    where A R4脳4 and B R4脳2 are the standard double-integrator matrices with t = 0.1 s. The reward is the usual LQR cost:

    t t

    r(st, at) = (sTQ st + aTR at), Q = I4, R = 0.1 I2.

    (4)

    quadratic in the state; together with the zero mean of wt, this is what produces the quadratic scaling derived below.

    D. PPO Training Setup

    Where classical control would hand us an analytic policy, we instead let PPO learn the ight controller directly from simulated experience, using the Stable-Baselines3 implemen-tation. We train a separate policy for each turbulence level . Both the policy and value networks are MLPs with two hidden layers of 64 units and Tanh activations, trained for 1 000 000 timesteps each. Hyperparameters are = 0.99,

    GAE = 0.95, learning rate 3 脳 104, rollout length 2048,

    minibatch size 64, and 10 optimization epochs per rollout. This lets us measure GAE bias not only under a xed analytic policy but inside a fully learned, black-box deep RL controller.

    Fig. 1. PPO training curves. Mean rollout return, value loss, and approximate KL divergence over 1 000 000 timesteps for each turbulence level . Higher produces visibly more variance and instability during learning.

  4. Main Theoretical Result

    A. Advantage Error Decomposition

    t

    Let AGAE() be the GAE advantage at step t computed from a trajectory drawn under disturbance intensity , using

    the true value function V . Dene the advantage estimation error at step t as

    GAE GAE

    t

    t

    t

    Expanding the GAE sum, the error splits into a weighted sum

    of TD-residual discrepancies:

    I

    () = IA () A (0)I . (6)

    Proof Sketch. The argument runs in four steps: (1) ex-pand each TD-residual discrepancy and drop the rst-order term, which vanishes in expectation since wt is

    zero-mean; (2) bound the surviving second-order term by

    IP I2 IAk EI2 2 via the Hessian of V ; (3) sum the squared

    GAE weights, the geometric series

    cl 2

    2l 2

    IT Lt1

    and (4) apply linearity of expectation over the T steps. The

    t I

    () =

    I

    l=0

    ()l [ () (0)] . (7)

    t+l t+l I

    l()

    = 1/(1() );

    I

    result is a genuine, non-vacuous upper bound: the empirically measured constant sits well below the theoretical C, the slack

    Each discrepancy carries three pieces: the reward difference at the perturbed state, the value at the perturbed next state, and the value at the perturbed current state. Since both the reward and the value are quadratic forms in the state, the discrepancy is controlled by the second moment of the state perturbation.

    B. Second-Order Sensitivity of the Value Function

    With V (s) = sTPs, the gradient is V (s) = 2Ps

    coming from the worst-case use of IP I2 in place of the

    trajectory-averaged P -norm and from cross-term cancellation across steps.

    D. Critical Disturbance Threshold

    A practical question follows: given a tolerance on advantage error, expressed as a fraction f of the typical advantage magnitude E[|AGAE|], what is the largest turbulence intensity we can allow? Treating the relationship as quadratic,

    and the Hessian is constant, 2V (s) = 2P . For a pertur-

    bation the expansion is exact:

    E[炉 ()] = C

    gives

    emp

    2, and setting the bias equal to f 路 E[|AGAE|]

    V (s + ) V (s) = 2 sTP TP . (8)

    A single wind impulse wt propagates through the closed-loop

    (f ) =

    f 路 E[|AGAE|]

    Cemp

    , (13)

    cl

    dynamics, so the perturbation k steps later is t+k = Ak E wt with Acl = A BK. Because wt is zero-mean, the rst-order term drops out in expectation,

    E 2 sTP t+k = 0, (9)

    and what survives is the second-order term, proportional to the second moment:

    k 2 2

    t+k

    t+k

    cl

    P

    2

    cl

    2

    E T P = iAk Ei2 2 IP I iA Ei . (10)

    where Cemp is the tted quadratic coefcient. Training with > (f ) pushes the advantage bias past tolerance f , at which point one should either reduce during early training or apply bias correction.

  5. Experimental Validation

    A. Setup

    The UAV environment is implemented in Python with Gym-

    nasium. Fixed hyperparameters across all runs: t = 0.1 s,

    This is the heart of the argument: a zero-mean disturbance passing through a quadratic value function produces an ex-pected bias that scales with the variance 2, not with . A rst-order Lipschitz treatment, which would predict linear scaling, vanishes in expectation and misses the dominant effect entirely.

    C. Main Theorem

    _ 1 L

    Theorem 1 (Quadratic GAE Bias Under Turbulence). Let the UAV dynamics satisfy Assumptions 13 (double-integrator structure, LQR policy, zero-mean lateral gust). Let V be the true LQR value function and Acl = A BK the closed-loop matrix. Then the expected mean GAE advantage error satises

    T = 200 steps per episode, = 0.99, = 0.95, and initial state s0 N (0, 0.25 I4). We sweep eight disturbance levels {0.00, 0.05, 0.10, 0.20, 0.30, 0.50, 0.80, 1.00} m/s,

    training an independent PPO policy to convergence at each. We then run 30 episodes to measure control performance (average return and success rate) and 200 episodes to estimate the GAE bias from the learned value critic. Results are reported as means 卤 standard deviation, with random seed 0 xed for reproducibility.

    B. Results

    Table I gives the full numbers, and they show a steep, non-linear rise in E[炉 ()] with . As a reference point, we also ran the analytical LQR controller of Section III-C under identical

    where

    T 1

    E[炉 ()] := E

    T

    t=0

    t() C 路 2, (11)

    T 1

    disturbance conditions over 3000 Monte Carlo trials per level. Because the linear policy never destabilizes, its bias tracks the theory almost perfectly, yielding a quadratic t with R2 = 0.9999. The learned PPO policy tells a different story. It holds up at low turbulence, i.e. = 0.0 gives a perfect success rate

    1 ()2

    k=0

    cl

    2

    (1.0) and an average return of 9.77 but degrades fast as

    grows. By = 0.50 the success rate has fallen to 0.07 (return

    C = 2 IP I2 路 路 T , = L iAk Ei2, (12)

    with IP I2 the spectral norm of the Riccati matrix (the magnitude of the value-function Hessian), the cumulative closed-loop disturbance-propagation gain, T the rollout length, the discount factor, and the GAE decay parameter.

    866.47), and at = 1.00 the policy fails outright, with a return of 1.3 脳 106. This trajectory divergence is what drives the GAE bias from 0.0 in calm air up to 6488.28 at the highest turbulence level.

    TABLE I

    Measured GAE advantage estimation error E[炉 ()] versus turbulence intensity , for the learned PPO policy. All vales are means over n = 200 evaluation episodes.

    (m/s)

    Success Rate

    Avg Return

    E[炉 ]

    0.00

    1.00

    -9.77

    0.00

    0.05

    1.00

    -12.55

    0.20

    0.10

    1.00

    -18.40

    0.81

    0.20

    0.90

    -59.26

    3.59

    0.30

    0.37

    -124.19

    10.23

    0.50

    0.07

    -866.47

    93.51

    0.80

    0.00

    -554637.84

    1813.46

    1.00

    0.00

    -1303219.38

    6488.28

    Fig. 2. PPO episode return versus turbulence intensity . Beyond =

    0.20 m/s the average return falls off steeply, ending in complete policy collapse and trajectory divergence at 0.80 m/s.

    C. Regression Analysis

    Fitting the quadratic model E[炉 ()] = Cemp 2 through the origin across the eight PPO measurements gives

    E[炉 ()] = 5178.02 2, (14)

    with R2 = 0.845. The contrast with the analytical baseline is the whole point: the LQR controller obeys the quadratic law cleanly (p = 2, R2 = 0.9999), whereas the PPO results pull away from it once the policy collapses. A free power-

    law t (p) to the PPO data returns an exponent of p = 3.44

    (R2 = 0.44). In other words, the learned network does not

    Fig. 3. Cause: GAE advantage bias versus turbulence intensity . The bias grows slowly at low and then explodes once the policy can no longer stabilize the vehicle.

    Fig. 4. Effect: episode return versus GAE advantage bias. Each point is one disturbance level; the strong inverse relationship shows that the growth in advantage error coincides directly with the loss of control performance.

    critical turbulence intensity at which the bias reaches a fraction

    f of this scale is

    (f ) =

    . (15)

    / f 脳 0.0272

    5178.02

    merely accumulate the structural quadratic penalty, it loses its stabilization ability entirely at high turbulence, so the state diverges and the bias climbs faster than the quadratic limit the theory sets for a stable policy.

    D. Critical Threshold Analysis

    The typical nominal advantage magnitude, estimated as E[|AGAE|] over the nominal rollout steps ( = 0.0), is 0.0272. With the tted quadratic coefcient Cemp = 5178.02, the

    For f = 0.10 and f = 0.25 this gives (0.10) 0.0007 m/s and (0.25) 0.0011 m/s. Two cautions are worth stating. First, these thresholds are remarkably tight, which says the learned PPO model is far more sensitive to turbulence than a stable linear controller, even a trace disturbance deserves attention when designing aerospace RL simulations. Second, because the true PPO degradation is steeper than quadratic (p = 3.44), the quadratic model underestimates the bias at larger ; the values above should therefore be read as an

    Fig. 5. Log-log plot of GAE bias versus . The free power-law t has slope 3.44, against the theoretical slope of 2.0 (dashed). The gap reects the combined effect of the structural quadratic penalty and the additional divergence of the neural policy.

    Fig. 6. Residuals of the two regression models. The power-law t tracks the extreme bias growth at 0.50 m/s noticeably better than the quadratic t, which under-predicts in that range.

    optimistic (upper) estimate of the safe operating range, not a guarantee.

  6. Discussion

    The central comparison – an analytical policy that follows the quadratic law almost exactly (R2 = 0.9999) against a learned policy that degrades faster (p = 3.44) – carries several practical lessons.

    First, the quadratic law is not an empirical accident; it is built into the estimator. Any locally quadratic value function, which covers both the exact LQR V and the second-order expansion of a smooth neural critic, under a zero-mean dis-turbance, will produce an expected advantage bias governed by the disturbance variance. The rst-order sensitivity one might expect to give linear scaling cancels in expectation. The practical reading is that small cuts in turbulence intensity buy disproportionately large cuts in estimation bias.

    Second, the result motivates a curriculum over turbulence during PPO training: start with < so the value function can settle on a near-stationary problem, then raise to build robustness. This echoes the domain-randomization curricula used in sim-to-real transfer [7], but here the schedule has a

    quantitative, square-root shape that comes out of Theorem 1 rather than intuition.

    Third, the looseness of the theoretical bound is itself in-formative. The slack has three sources: (i) the global spectral norm IP I2 overstates the local sensitivity along the nominal trajectory, where state norms are small; (ii) the bound assumes

    every GAE term hits its maximum error at once, whereas in practice errors partly cancel across steps; and (iii) the cumulative gain sums squared closed-loop propagations without the cancellation seen in the data. One could tighten the bound by substituting Cemp directly, but that would forfeit the explicit dependence on the system parameters. We keep the analytical form because it exposes how the bias depends on , , T , and P .

    Fourth, the threshold 0.0007 m/s is specic to this

    UAV model and PPO setup. A different vehicle or network will have a different Cemp, but the protocol in Section V-A trans-fers directly to obtain vehicle-specic thresholds. Theorem 1 supplies the structural justication for that procedure, even as the policys own divergence adds a layer of non-linearity on top of the structural error.

    A clear limitation is the use of an i.i.d. Gaussian gust rather than a fully colored Dryden process. Temporal correlation would change but not the quadratic scaling, which depends only on the zero mean of the disturbance and the quadratic critic. Extending the bound to a colored spectrum and to richer neural critics are both natural next steps.

  7. Conclusion

We have analyzed how GAE advantage estimates degrade under non-stationary aerodynamic disturbances in UAV rein-forcement learning. Theorem 1 establishes a quadratic upper bound A() C 路 2 for a stable policy, and the analytical LQR baseline conrms it almost exactly (R2 = 0.9999). A learned PPO controller degrades faster (exponent 3.44) because it loses stabilization entirely at high turbulencethe quadratic t reaches only R2 = 0.845, and the derived threshold 0.0007 m/s (at 10% tolerance) is strikingly narrow. Together these give aerospace RL practitioners both a warning and a principled criterion for simulation design.

Future work will extend the analysis to neural-network critics, to multi-axis and colored (Dryden / von Ka麓rma麓n) turbulence, and to nite-time PPO convergence rates under non-stationarity. All code for the numerical results is available at https://github.com/iitb-kabir/PPOAdvantageEstimates.

Acknowledgment

The authors thank the anonymous reviewers for their con-structive comments. This work was supported by the Depart-ment of Aerospace Engineering, IIT Bombay, India.

References

  1. W. Koch, R. Mancuso, R. West, and A. Bestavros, Reinforcement learning for UAV attitude control, ACM Trans. Cyber-Phys. Syst., vol. 3, no. 2, pp. 122, 2019.

  2. A. Loquercio, E. Kaufmann, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza, Learning high-speed ight in the wild, Science Robotics, vol. 6, no. 59, p. eabg5810, 2021.

  3. C. Wang, J. Wang, Y. Shen, and X. Zhang, Autonomous avigation of UAVs in large-scale complex environments: A deep reinforcement learning approach, IEEE Trans. Veh. Technol., vol. 68, no. 3, pp. 21242136, 2019.

  4. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Prox-imal policy optimization algorithms, arXiv preprint arXiv:1707.06347, 2017.

  5. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, High-dimensional continuous control using generalized advantage estimation, in Int. Conf. Learning Representations (ICLR), 2016.

  6. MIL-STD-1797A: Flying Qualities of Piloted Aircraft, United States Department of Defense, 1990.

  7. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, Domain randomization for transferring deep neural networks from simulation to the real world, in IEEE/RSJ IROS, 2017, pp. 2330.

  8. A. Nilim and L. El Ghaoui, Robust control of Markov decision processes with uncertain transition matrices, Oper. Res., vol. 53, no. 5,

    pp. 780798, 2005.

  9. S. Padakandla, P. K. J., and S. Bhatnagar, Reinforcement learning algorithm for non-stationary environments, Appl. Intell., vol. 50, no. 11,

    pp. 35903606, 2020.

  10. E. Bohn, S. Gros, and M. Diehl, Reinforcement learning of xed-wing ight with turbulence disturbances, in AIAA SciTech Forum, Paper AIAA 2019-0538, 2019.

  11. J. Panerati et al., Learning to Fly a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Con-trol, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.

  12. B. D. O. Anderson and J. B. Moore, Optimal Control: Linear Quadratic Methods. Englewood Cliffs, NJ: Prentice-Hall, 1990.

  13. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018.

Appendix

All experiments were implemented in Python 3.10 using NumPy and Stable-Baselines3. The Riccati equation was solved by iterating the standard DARE recursion to tolerance 1012, reached in 141 iterations. Random seed 0 was xed via numpy.random.seed(0) before all experiments. PPO training completes in a few hours on a modern multi-core CPU.

The Riccati matrix P has diagonal entries p11 = p33 = 12.999 and p22 = p44 = 4.848, with off-diagonal elements p12 = p21 = p34 = p43 = 3.030 and all cross-axis entries zero (decoupled x-y dynamics). The spectral norm is IP I2 =

14.378. The closed-loop matrix Acl has spectral radius 0.904, conrming asymptotic stability, with cumulative disturbance-

k

cl

2

propagation gain = IAk EI2 = 2.42.