Deep Reinforcement Learning for Aerobraking Maneuvers

Course: Intro to Robot Learning
Carnegie Mellon University
Pittsburgh, PA

March 2024 - May 2024

Aerobraking is a technique that utilizes a planet’s atmosphere to adjust a spacecraft’s trajectory from an interplanetary hyperbolic path to a desired elliptical orbit. Traditional aerobraking methods depend on constant ground station monitoring, which causes delays due to the long distances in interplanetary missions. To overcome these challenges, this project applies a reinforcement learning approach to enable efficient onboard control. Specifically, we train a Proximal Policy Optimization (PPO) agent to execute aerobraking maneuvers more quickly and with lower fuel consumption.

Visualization of an Aerobraking Campaign, starting from a hyperbolic interplanetary orbit, to the desired lower-energy elliptical orbit

The goal of aerobraking is to transition a spacecraft from a hyperbolic interplanetary trajectory into a high-energy elliptical orbit, with its periapsis (the closest point to the planet) inside the planet’s atmosphere, as shown in the figure above. During repeated passes through the atmosphere, known as drag passages, the spacecraft’s orbital energy gradually decreases until the desired apoapsis radius is achieved. At this point, additional propulsive maneuvers are performed to raise the periapsis out of the atmosphere and stabilize the orbit. Control is applied through ΔV adjustments at the apoapsis, which either raise or lower the periapsis, increasing or decreasing the drag effect accordingly.

This project builds off work done by Falcone et al. to develop an aerobraking agent simulating the Mars Odyssey orbit insertion mission. The spacecraft's state is depicted by the orbital elements of its current orbit - Apoapsis Radius, Periapsis Altitude, Inclination, AOP (Argument of Periapsis), RAAN (Right Ascension of the Ascending Node).

The system's action space is a set of ΔV's that can be applied at each apoapsis - A = [−1.0, −0.5, −0.3, −0.2, −0.1, −0.05, 0, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0]. Negative actions lower the periapsis while positive actions raise it.

Every aerobraking episode begins with an initial apoapsis radius of 10038 km. The episode terminates when one of three conditions occur : a) The spacecraft attains/surpasses the target apoapsis radius of 4906 km or b) the spacecraft's periapsis altitude falls below 85 km below which it is bound to crash into the planet's surface or c) the periapsis altitude exceeds 135 km beyond which atmospheric drag is negligible.

Falcone et al. simulated their agent on a high-fidelity physics-based simulator. Due to the complex physics involved, the authors required a training time of 1 week on a 50 core cluster. To reduce training time, a neural network based system model with 2 hidden layers and 32 neurons per layer, was trained on the results of the physics based simulator. The trained model was accurate to within 1km of the physics-based results as shown in the figure below.

The trained Neural Network model was found to be accurate to within 1km of the high-fidelity physics simulation. The red lines represent the physics based model and the green lines represent the trained model.

The most important part of an reinforcement learning based solution is the reward formulation. As shown below, a combination of a continuous and terminal reward function were used to provide the agent with continuous feedback. The agent is penalized on the normalized distance to the target and the magnitude of the action used.

The policy was represented by a neural network with 4 hidden layers and 512 hidden neurons per layer. The policy took the current state as input and gave the softmax-probabilities of each action in the action space. An action was randomly sampled weighted by these probabilities and applied on the system. Policy updates were clipped to 0.2 to maintain the sampling assumptions.

After training the Proximal Policy agent over 1.1 million steps (where each step is an action applied at the apoapsis, a subsequent drag passage and finally a return to the new apoapsis) with a batch size of 512 steps, the trained agent returned an average episodic reward of +5 with a terminal error of 8.4 km.

Average episode length (number of propulsive maneuvers) (left) and the average episodic return (right) over number of training steps. As shown, the agent stabilizes around a return of +5 after 200,000 steps of training

To verify the model against the actual physics based simulator, the inputs computed by the agent are fed into the physics simulator and the output was observed. Due to the high accuracy of our NN-state space model, the provided inputs succeeded in bring the spacecraft to a target orbit as shown below.

Our PPO agent was compared against the actual Mars Odyssey Mission as well as a Double Deep Q-Network baseline, and the results are shown in the table below. Overall, our PPO agent guarantees a successful aerobraking maneuver with lower fuel consumption and maneuver time, compared to traditional human-involved decision making.

Read the full report below:

Deep Reinforcement Learning for Aerobraking Maneuvers

Course: Intro to Robot Learning Carnegie Mellon University Pittsburgh, PA

March 2024 - May 2024

Course: Intro to Robot Learning
Carnegie Mellon University
Pittsburgh, PA