Giter Site home page Giter Site logo

seuwky1021 / angry_birds_rl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from parshakova/angry_birds_rl

0.0 0.0 0.0 5.92 MB

Playing Angry Birds: Deep Deterministic Policy Gradient with Attention-based Long Short-Term Memory

License: Other

JavaScript 1.25% CSS 0.05% Python 24.84% Java 72.66% C++ 1.19%

angry_birds_rl's Introduction

Playing Angry Birds: Deep Deterministic Policy Gradient with Attention-based Long Short-Term Memory

Tensorflow implementation of a sequential decision-making agent for solving "Angry Birds" using Deep Deterministic Policy Gradient (DDPG) with Attention-based Long Short-Term Memory (LSTM) for state encoding. In other words, we explore a model with a deterministic policy using an actor-critic algorithm for learning off policy with a stochastic behavior policy.

Within Angry Birds Basic Game Playing Software environment.


Structure overview of the Angry Bird player: DDPG with Attention-based LSTM State Encoder.

Preliminary

The code collects transitions which consist of (state, action, reward, new state, done) tuples. A raw state is a list of categories for 7 different objects: red bird, yellow bird, blue bird, pig, ice, wood, stone, and each category consists of n elements of objects with coordinates: x, y, width, height, angle. This state goes to the input of attentional LSTM state encoder and then, the output is the vector, size of [batch x n_hidden]. The action includes a measure of how much a band of the sling can extend for two different directions: x and y, and tap time for yellow bird and blue bird considering their characteristics, e.g. action scales [[0, -35], [0, 35], [0,2000]]. The reward is the score that we get from the game agent after performing each action. Lastly, states and rewards are normalized to [0, 1].

Model Description

"Angry Birds" is a game with a continuous action and state space. The policy gradient methods proved their effectiveness in a number of various domains. However, for a stochastic policy, the variance of a gradient estimate goes to infinity as policy becomes more and more deterministic. Therefore deterministic policy, a limiting case of a stochastic policy (proved by D.Silver 2014) where we simply update policy parameters in the direction of gradients of a Q function, was proposed to deal with high-dimensional continuous action spaces.

In order to explore with a deterministic policy, we use an actor-critic algorithm for learning off policy with a stochastic behavior policy. In our case exploration policy is obtained by adding a noise from an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to a deterministic action from an actor.

Since the java visual module is able to extract locations of movable objects (each location is of size n_coord = 5), we decided to use this information for a state description over the raw image as it contains already all useful information without the noise or background.


GIU of the Angry Birds.

In this game, we considered 7 main categories of objects: red bird, yellow bird, blue bird, pig, ice, wood, and stone. On each screen there can be a different number of these objects, thus to encode the sequences of different length we selected LSTM. Also in actor-critic architecture, there are two streams of gradients coming from two networks (actor and critic), it was one of the main reasons to use LSTM encoding since in this case the encoded state is fed to actor and critic and thus LSTM could be trained faster due to double error backpropagation.

The input is a list with 7 elements of size [batch x seqlen x n_coord], where seqlen (sequence length) varies for each category. Firstly, each category of object of size [batch x seqlen x n_coord] is transformed linearly with a unique matrix into [batch x seqlen x n_hidden], which is later encoded into a fixed size vector [batch x n_hidden] with a multiple layer LSTM. After that, we have 7 vectors of size [batch x n_hidden] that is encoded with attentional LSTM into a one fixed size vector [batch x n_hidden], which in principle has to catch the importance of each category and to which one pay more attention.

Connection Pipeline


Connection pipeline overview.

Credits

Reimplementation of DDPG based on OpenAI Gym + Tensorflow: @floodsung

Developers of Angry Birds Basic Game Playing Software https://aibirds.org/

angry_birds_rl's People

Contributors

parshakova avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.