Giter Site home page Giter Site logo

mimuw-rl / space-gym Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 24.52 MB

Challenging reinforcement learning environments with locomotion tasks in space

Python 70.62% Jupyter Notebook 29.38%
machine-learning reinforcement-learning reinforcement-learning-environments simulator

space-gym's Introduction

Space-Gym

Set of RL environments with locomotion tasks in space. The goal is to navigate a (planar) spaceship to reach the prescribed goals, or enter a prescribed orbit. We define a few tasks with varying difficulty, and some of the tasks we created are hard for state-of-the-art off-policy algorithms (SAC, TD3).

We learned a lot from the environment design process. We find it particularly challenging to appropriately shape the reward function such that the RL algorithm converges to a satisfactory control.

The goal of the repository is to share it with the community as a benchmark that can be used to test suitable reinforcement learning methods and algorithms. We believe that all of the tasks can be solved in a much better way than demonstrated.

Authors : Jacek Cyranka & Kajetan Janiak (University of Warsaw)

A paper with extended versions of the environments is currently under preparation.

If you have feedback or any questions/requests concerning the Space-Gym envs, do not hesitate to post an issue here or send it to the author(s) by direct mail.

Installation

pip install -e ., then see example in keyboard_agent.py

Environments

GoalEnv

Navigate the spaceship to achieve subsequent goal positions while avoiding crashing on any planet and leaving the world (window) boundaries.

Parameters:

  • n_planets - number of planets to avoid
  • survival_reward_scale - fraction of reward for staying alive (not crashing)
  • goal_vel_reward_scale - fraction of reward for velocity toward current goal
  • safety_reward_scale - fraction of reward for not flying fast toward close obstacles
  • goal_sparse_reward - reward for achieving a goal
  • ship_steering - if ship is steered by the angular velocity (the action sets the angular velocity) or the angular acceleration (the action sets the angular acceleration), then the ship has fixed moment of intertia (set using another parameter ship_moi)

For the exact formula of the reward please refer to GoalEnv._reward() in gym_space/envs/goal.py.

There are several difficulty levels. For each level we provide the rewards achieved by the best RL method that we tested and the Human baseline score, obtained using the so-called keyboard-agent (see keyboard_agent.py)

  1. two planets present within the region boundaries (there is predefined env with default parameters GoalContinuous2P-v0), it is easily solved by off-policy RL algorithms (SAC & TD3).
  2. three planets present within the boundaries (there is predefined env with default parameters GoalContinuous3P-v0), much harder challenge than two planets, all of the tested RL methods have issues grasping how to use gravity and avoid crashing.
  3. four planets present within the boundaries (there is predefined env with default parameters GoalContinuous4P-v0), this environment is not solvable and the policy is not able to avoid continuously crasing.

Kepler Orbit Env

Control spaceship to enter a specified orbit from any initial condition utilizing the least energy.

Parameters:

  • ship_steering - if ship is steered by the angular velocity (the action sets the angular velocity) or the angular acceleration (the action sets the angular acceleration), then the ship has fixed moment of intertia (set using another parameter ship_moi).
  • rad_penalty_C - penalty term coefficient for the distance to the reference orbit radius (reward is inversely proportional to the distance).
  • numerator_C - the constant in the denominator of the step-wise reward value, and added in the denominator (hence the maximal step-wise reward is 1).
  • act_penalty_C - penalty term coefficient for the energy utilized to perform the action during the current step.
  • step_size - the numerical integrator step size.
  • randomize- if the orbit should be randomized at every reset (then the parameters of the orbit are appended to the observation vector).
  • ref_orbit_a, ref_orbit_eccentricity, ref_orbit_angle if the orbit is fixed, the reference parameters of the target orbit.

The reward is also inversely proportional to the absolute distance of the current velocity to the reference orbit velocity, which is easily computed for the Kepler orbits.

For the exact formula of the reward please refer to KeplerEnv._reward() in gym_space/envs/kepler.py.

Preliminary Training Results

We perofrmed a bunch of a trainings using the Stable-baselines3 software, in particular the rl-baselines3-zoo. We used the default hyperparameters of TD3, SAC. PPO performed significantly worse. A preliminary hyperparameter optimization that we performed showed no significant improvements over the default ones.

GoalEnv

2 Planets

Human baseline score (mean/std.dev. from 5 episodes) measured using keyboard_agent.py 4715 +- 799

3 Planets

Human baseline score (mean/std.dev. from 5 episodes) measured using keyboard_agent.py 4659 +-747

4 Planets

Kepler Orbit Env

Circle Orbit

Ellipsoidal Orbit

Conclusions and Future Work

There is still significant room for improving the performance of RL agents in the presented environments. One particularly promising direction is to try a safety RL method. We expect that better-shaped reward functions and extended observation vectors may result in significant performance improvements as well.

We could not explain the dramatic performance drop when increasing the number of planets from 2 up to 3. The measured human baseline score is similar for the two planets env and is significantly smaller for the case of 3 planets.

Implementation Remarks

Environments

There are six non-abstract environment classes. Three with discrete and three with continuous action spaces, defined in envs/do_not_crash.py, envs/kepler.py, envs/goal.py.

All of them inherit from abstract base class SpaceshipEnv defined in envs/spaceship_env.py. This class takes care of physics, actions, collisions etc. Child classes have to instantiate base class with selected parameters values and implement _reset and _reward methods.

Physical dynamic

All code responsible for physics simulation is in dynamic_model.py. Please refer to the docstrings in that file.

Rendering

Class Renderer in rendering.py is responsible for visualization of the environment. The class won't be instantiated and nothing will be drawn unless render() method of an environment is called.

Parameters num_prev_pos_vis and prev_pos_color_decay allow you to control how the tail trailing the ship position looks and how long it is.

GoalEnv initial position sampling

In order to make initial position sampling for large number of planets efficient, we implemented an algorithm based on hexagonal tiling of a plane. Related code is in hexagonal_tiling.py. To make sense of it, please refer to notebooks/hexagonal_tiling.ipynb.

Stable-Baselines 3 starting agents

TBA

License

Copyright 2021 Jacek Cyranka & Kajetan Janiak (University of Warsaw)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

space-gym's People

Contributors

dzako avatar

Stargazers

Tokarev Igor avatar Tristan Bester avatar Shyam Sudhakaran avatar Simo Ryu avatar Berk avatar  avatar  avatar Eetu Rantala avatar

Watchers

 avatar

space-gym's Issues

[GoalEnv] ciekawsze warunki początkowe

planety czesto sa losowane obok siebie, doalbym jakis clear-radius i moze losowanie pozycji statku w dolnym trojkacie a celu w gornym trojkacie, zeby czesto dochodzilo do interakcji z planetami

planety i cel możemy sobie ustawiać jak nam się podoba, ale pozycja statku powinna być jak najbardziej losowa.

identyfikatory do wektora obserwacji

w tej chwili surowy wektor obserwacji jest malo czytelny
[-0.03321611 -0.02896443 0.60457392 0.79654904 -0.1300648 0.21109212 0.06192183 0.63304779 -0.03531498 0.02768544 -0.00573701 -0.08104321 -0.92910233 0.23563375 -0.47430738 -0.22272539 -0.14541353 0.35445339 0.55028567]

wypadałoby dorzucic funkcje zwracajaca identyfikator kazdej wspolrzednej lub getter
do kazdego typu zmiennej

[GoalEnv] poprawki wizualizacji

Ślad statku i wydech silnika w tej chwili dostają szary kolor, kiedy powinny być czarne, ale trochę przezroczyste. To wygląda źle, kiedy zasłaniają inne obiekty.

[GoalEnv] implementacja rewardu

Nagroda jest gęsta. W każdym kroku jest to stała za przeżycie plus ograniczona nieujemna wartość za bycie w pobliżu punktu docelowego. ( - stała * norma_akcji jako penalty za nie-ekonomicznosc)

[GoalEnv] Nie uczy sie

GoalEnv sie nie uczy, powodem prawdopodobnie jest niepelny wektor obserwacji, brakuje informacji o przyspieszeniu lub lidarow mierzących odległość, statek dowiaduje sie ze jest za blisko planety dopiero po rozbiciu sie i terminacji epizodu. Pewnie reward jest tez zbyt sparse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.