Giter Site home page Giter Site logo

reinforcement-learning-papers-at-icml's Introduction

Reinforcement-Learning-Papers-at-ICML

35th 2018

Introduction

  • The 35th International Conference on Machine Learning (ICML 2018) was held in Stockholm, Sweden, from Tuesday July 10 — Sunday July 15, 2018. ICML is one of the most prestigious AI-related academic conferences and is a highly coveted place to publish research papers.
  • 621 papers were accepted out of 2,473 submissions for a 25.1% acceptance rate. That’s 47% more submissions than last year’s 1,676 (with a similar 25% acceptance rate).
  • RL sessions were in the biggest room and had the most papers.
  • This is your one stop shop for everything RL at ICML18.

RL Categories

I categorize all RL papers accepted into following topics:

  • RL Theory
  • RL Network Architecture
  • RL Algorithms
  • RL Optimization
  • RL Exploration
  • RL Reward
  • Model-based RL
  • Distributed RL
  • Hierarchical RL
  • Multi-Agent
  • Meta-Learning, Transfer, Lifelong Learning
  • Applications and Visualization

RL Theory:

  • "Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs" -> non standard transition model, learn to convert MPDs to MACs.
  • "Learning with Abandonment" -> non standard transition model, a platform that wants to learn a personalized policy for each user, but the platform faces the risk of a user abandoning the platform if dissatisfied with the actions of the platform.
  • "Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator" -> LQR proof
  • "More Robust Doubly Robust Off-policy Evaluation" -> estimate the performance of a policy from the data generated by another policy.
  • "Best Arm Identification in Linear Bandits with Linear Dimension Dependency" -> exploiting the global linear structure to improve the estimate of the reward of near-optimal arms.
  • "Convergent Tree Backup and Retrace with Function Approximation" -> stable and efficient gradient-based algorithms using a quadratic convex-concave saddle-point formulation
  • "Time Limits in Reinforcement Learning" -> formal account for how time limits could effectively be handled in cases and explain why not doing so can cause state-aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. For fixed period, terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent’s input to avoid violation of the Markov property.
  • "Configurable Markov Decision Processes" -> In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration.

RL Network Architecture:

  • "Structured Control Nets for Deep Reinforcement Learning" -> proposed Structured Control Net splits the generic MLP into two separate sub-modules: a nonlinear control module and a linear control module. The nonlinear control is for forward-looking and global control, while the linear control stabilizes the local dynamics around the residual of global control.
  • "Gated Path Planning Networks" -> reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. Gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN.
  • "Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control" -> This planning computation unrolls a forward model in a latent space and infers an optimal action plan through gradient descent trajectory optimization, optimizing a supervised imitation learning objective. The representations learned also provide a metric for specifying goals using images, when solving new tasks described via image-based goals.

RL Algorithms:

  • "SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation" -> reformulate Bellman equation into a novel primal-dual optimization problem using Nesterov’s smoothing technique and the Legendre-Fenchel transformation, develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used.
  • "Scalable Bilinear Pi Learning Using State and Action Features" -> for large-scale Markov decision processes (MDP), we study a primal-dual formulation of the Approximate linear programming, and develop a scalable, model-free algorithm called bilinear pi learning for reinforcement learning when a sampling oracle is provided.
  • "Beyond the One-Step Greedy Approach in Reinforcement Learning" -> analyze the case of multiple-step lookahead policy improvement; formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence.
  • "Importance Weighted Transfer of Samples in Reinforcement Learning" -> the transfer of experience samples collected from a set of source tasks to improve the learning process in a given target task. Proposed a model-based technique that automatically estimates the relevance (importance weight) of each source sample for solving the target task.
  • "Addressing Function Approximation Error in Actor- Critic Methods" -> algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation; delaying policy updates to reduce per-update error.
  • "Policy Optimization with Demonstrations" -> leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations for implicit dynamic reward shaping.

RL Optimization:

  • "Policy Optimization as Wasserstein Gradient Flows" -> on the probability-measure space, policy optimization becomes convex in terms of distribution optimization, interpreted as Wasserstein gradient flows
  • "Clipped Action Policy Gradient" -> exploits the knowledge of actions being clipped to reduce the variance in estimation.
  • "Fourier Policy Gradients" -> recasts integrals that arise with expected policy gradients as convolutions and turns them into multiplications.
  • "Structured Evolution with Compact Architectures for Scalable Policy Optimization" -> blackbox optimization via gradient approximation with the use of structured random orthogonal matrices, providing more accurate estimators than baselines and with provable theoretical guarantees.
  • "Stochastic Variance-Reduced Policy Gradient" -> leverages on importance weights to preserve the unbiased- ness of the gradient estimate.
  • "The Mirage of Action-Dependent Baselines in Reinforcement Learning" -> decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains.
  • "Smoothed Action Value Functions for Learning Gaussian Policies" -> a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. The ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
  • "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" -> propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework; the actor aims to maximize expected reward while also maximizing entropy — succeed at the task while acting as randomly as possible.

RL Exploration:

  • "Self-Imitation Learning" -> exploiting past good experiences can indirectly drive deep exploration.
  • "Coordinated Exploration in Concurrent Reinforcement Learning" -> team of reinforcement learning agents that concurrently learn to operate in a common environment using seed sampling, with three properties — adaptivity, commitment, and diversity — necessary for efficient coordinated exploration.
  • "GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms" -> sequentially combining a Goal Exploration Process and DDPG. Two-phase approach: a first exploration phase discovers a repertoire of simple policies maximizing behavioral diversity, ignoring the reward function; then more standard deep RL phase for fine-tuning, where DDPG uses a replay buffer filled with samples generated by GEP.
  • "Learning to Explore via Meta-Policy Gradient" -> meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Train flexible exploration behaviors that are independent of the actor policy, yielding a global exploration that significantly speeds up the learning process.

RL Reward:

  • "Learning by Playing — Solving Sparse Reward Tasks from Scratch" -> Scheduled Auxiliary Control (SAC-X), the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. Active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment — enabling it to excel at sparse reward RL.
  • "Automatic Goal Generation for Reinforcement Learning Agents" -> use a generative model (GANs in this case) to learn to generate desirable “goals” (subsets of the state space) and use that instead of uniform sampling for goals. Solve multi-task problems with an automatic curriculum generation algorithm based on a generative model that tracks the learning agent’s performance.
  • "Learning the Reward Function for a Misspecified Model" -> This paper presents a novel error bound that accounts for the reward model’s behavior in states sampled from the model. This bound is used to extend the existing Hallucinated DAgger-MC algorithm, which offers theoretical performance guarantees in deterministic MDPs that do not assume a perfect model can be learned.
  • "Mix & Match — Agent Curricula for Reinforcement Learning" -> a procedure to automatically form a curriculum over agents; progressively train more complex agents by, effectively, bootstrapping from solutions found by simpler agents; an ensemble or a mixture of experts agent.

Model-based RL:

  • "Lipschitz Continuity in Model-based Reinforcement Learning" -> provide a novel bound on multi-step prediction error of Lipschitz models where we quantify the error using the Wasserstein metric.
  • "Programmatically Interpretable Reinforcement Learning" -> to generate interpretable and verifiable agent policies, Programmatically Interpretable Reinforcement Learning represents policies using a high-level, domain-specific programming language. Neurally Directed Program Search works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural “oracle”.
  • "Feedback-Based Tree Search for Reinforcement Learning" -> propose a model-based reinforcement learning (RL) technique that iteratively applies MCTS on batches of small, finite-horizon versions of the original infinite-horizon Markov decision process. The recommendations generated by the MCTS procedure are then provided as feed- back in order to refine, through classification and regression, the leaf-node evaluator for the next iteration. Competitive agent for multi-player online battle arena (MOBA) game King of Glory.
  • "Machine Theory of Mind" -> Theory of mind (ToM) broadly refers to humans’ ability to represent the mental states of others, including their desires, beliefs, and intentions. ToMnet uses meta-learning to learn a strong prior model for agents’ future behaviour, and, using only a small number of behavioural observations, can bootstrap to richer predictions about agents’ characteristics and mental states.
  • "Measuring abstract reasoning in neural networks" -> propose a dataset and challenge designed to probe abstract reasoning, inspired by a well-known human IQ test. To succeed at this challenge, models must cope with various generalisation ‘regimes’ in which the training and test data differ in clearly-defined ways. propose Wild Relation Network (WReN) applied a Relation Network module (Santoro et al., 2017) multiple times to infer the inter-panel relationships.

Distributed RL:

  • "Implicit Quantile Networks for Distributional Reinforcement Learning" -> using quantile regression to approximate the full quantile function for the state-action return distribution for risk-sensitive policies; demonstrate improved performance on the 57 Atari 2600 games.
  • "RLlib: Abstractions for Distributed Reinforcement Learning" -> a library in open source Ray project that provides scalable software primitives for RL, which argues for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks.
  • "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" -> IMPALA (Importance Weighted Actor-Learner Architecture) scales to thousands of machines without sacrificing data efficiency or resource utilisation; stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. Tested on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)).

Hierarchical RL:

  • "Latent Space Policies for Hierarchical Reinforcement Learning" -> construct hierarchical representations in bottom-up layerwise fashion; each layer is trained to solve the task via a maximum entropy objective. Each layer is also augmented with latent random variables, sampled from a prior distribution during the training of that layer. The maximum entropy objective causes these latent variables to be incorporated into the layer’s policy, and the higher level layer can directly control the behavior of the lower layer through this latent space.
  • "Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings" -> the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. his model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior.
  • "An Inference-Based Policy Gradient Method for Learning Options" -> To automatic learn policies with options, the proposed algorithm uses inference methods to simultaneously improve all of the options available to an agent, and thus can be employed in an off-policy manner, without observing option labels. The differentiable inference procedure employed yields options that can be easily interpreted.
  • "Hierarchical Imitation and Reinforcement Learning" -> hierarchical guidance leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Tested on Montezuma’s Revenge.
  • "Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning" -> Reward Machines is a type of finite state machine that supports the specification of reward functions while exposing reward function structure to the learner and supporting decomposition. Propose Q-Learning for Reward Machines (QRM), an algorithm which appropriately decomposes the reward machine and uses off-policy q-learning to simultaneously learn subpolicies for the different components.

Multi-Agent:

  • "Modeling Others using Oneself in Multi-Agent Reinforcement Learning" -> setting: multi-agent reinforcement learning with imperfect information in which each agent is trying to maximize its own utility. The reward function depends on the hidden state (or goal) of both agents, so the agents must infer the other players’ hidden goals from their observed behavior in order to solve the tasks. The agent uses its own policy to predict the other agent’s actions and update its belief of their hidden state in an online manner.
  • "QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning" -> QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. Tested on StarCraft II micromanagement tasks.
  • "Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems" -> exploiting loose couplings, i.e., conditional independences between agents. The expected rewards can be expressed as a coordination graph.
  • "Learning to Act in Decentralized Partially Observable MDPs" -> first near-optimal cooperative multi-agent, by replacing the greedy maximization by mixed-integer linear programming. Experiments in many finite domains from the literature.
  • "Learning Policy Representations in Multiagent Systems" -> casts agent modeling as a representation learning problem; construct a novel objective inspired by imitation learning and agent identification and design an algorithm for unsupervised learning of representations of agent policies.
  • "Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations" -> inverse reinforcement learning in zero-sum stochastic games when expert demonstrations are known to be suboptimal; introduce a new objective function that directly pits experts against Nash Equilibrium policies, to solve for the reward function in the context of inverse reinforcement learning with deep neural networks as model approximations.

Meta-Learning, Transfer, Lifelong Learning:

  • "Been There, Done That: Meta-Learning with Episodic Recall" -> propose a formalism for generating open-ended yet repetitious environments, then develop a meta-learning architecture for solving these environments. This architecture melds the standard LSTM working memory with a differentiable neural episodic memory.
  • "Transfer in deep reinforcement learning using successor features and generalised policy improvement" -> use Generalized Policy Improvement and Successor Features for transfer skills. extend the SF&GPI framework in two ways. use the reward functions themselves as features for future tasks, without any loss of expressiveness, thus removing the need to specify a set of features beforehand.
  • "Policy and Value Transfer in Lifelong Reinforcement Learning" -> use prior experience to bootstrap lifelong learning in a series of task instances drawn from some task distribution. For value-function-based transfer, value-function initialization methods that preserve PAC guarantees while simultaneously minimizing the learning required in two learning algorithms, yielding MaxQInit.
  • "State Abstractions for Lifelong Reinforcement Learning" -> In lifelong reinforcement learning, agents must effectively transfer knowledge across tasks while simultaneously addressing exploration, credit assignment, and general. State abstraction compresses the representation used by an agent, thereby reducing the computational and statistical burdens of learning. Propose new classes of abstractions: (1) transitive state abstractions, whose optimal form can be computed efficiently, and (2) PAC state abstractions, which are guaranteed to hold with respect to a distribution of tasks.
  • "Continual Reinforcement Learning with Complex Synapses" -> By equipping tabular and deep reinforcement learning agents with a synaptic model that incorporates this biological complexity (Benna & Fusi, 2016), catastrophic forgetting can be mitigated at multiple timescales. Consolidation process is agnostic to timescale of changes in data distribution.

Applications and Visualization:

  • "Spotlight: Optimizing Device Placement for Training Deep Neural Networks" -> using a multi-stage Markov decision process to model the device placement problem.
  • "End-to-end Active Object Tracking via Reinforcement Learning" -> ConvNet-LSTM function approximator is adopted for the direct frame-to-action prediction. Need to augment the environment with reward function.
  • "Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling" -> learn game strategies based on a kernel-based Monte Carlo tree search that finds actions within a continuous space. To avoid hand-crafted features, we train our network using supervised learning followed by reinforcement learning with a high-fidelity simulator for the Olympic sport of curling; won an international digital curling competition.
  • "Can Deep Reinforcement Learning Solve Erdos- Selfridge-Spencer Games?" -> introduces an interesting family of two-player zero-sum games with tunable complexity, called Erdos-Selfridge-Spencer games, as a new domain for RL. The authors report on extensive empirical results using a wide variety of training methods, including supervised learning and several flavors of RL (PPO, A2C, DQN) as well as single-agent vs. multi-agent training.
  • "Investigating Human Priors for Playing Video Games" -> investigate the various kinds of prior knowledge that help human learning and find that general priors about objects play the most critical role in guiding human gameplay.
  • "Visualizing and Understanding Atari Agents" -> introduce a method for generating useful saliency maps and use it to show 1) what strong agents attend to, 2) whether agents are making decisions for the right or wrong reasons, and 3) how agents evolve during learning.

Conclusion

  • In addition to training algorithms, learning model, credit assignment, hierarchical, meta-learning, and network architectures are popular sub-directions for RL.
  • There a big less-explored space for RL on network architectures, considering the amount of papers on network architectures for vision problems. There are only few in the accepted papers. e.g. "Structured Control Nets for Deep Reinforcement Learning and Gated Path Planning Networks".
  • Fairness and Interpretability of ML is a big theme. There should be more work on interpretation and analysis for RL as well. One good direction is to use Control Theory. Relating this, Ben Recth’s Optimization for Control tutorial was excellent one. The main ideas were there should be more crossing-overs between RL and Control Theory. One excellent example in the accepted papers is "Structured Control Nets for Deep Reinforcement Learning".

Contributing

Please feel free to create a Pull Request for adding papers of reinforcement learning papers at ICML.

Support

If you found this useful, please consider starring(★) the repo so that it can reach a broader audience.

reinforcement-learning-papers-at-icml's People

Contributors

luckyzxl2016 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

namhar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.