|
|
|
|
Reinforcement learning presents an attractive paradigm to reason about several distinct aspects of sequential decision making, such as specifying complex goals, planning future observations and actions, and critiquing their utilities. However, the combined integration of these capabilities poses competing algorithmic challenges in retaining maximal expressivity while allowing for flexibility in modeling choices for efficient learning and inference. We present Decision Stacks, a generative framework that decomposes goal-conditioned policy agents into 3 generative modules. These modules simulate the temporal evolution of observations, rewards, and actions via independent generative models that can be learned in parallel via teacher forcing. Our framework guarantees both expressivity and flexibility in designing individual modules to account for key factors such as architectural bias, optimization objective and dynamics, transferrability across domains, and inference speed. Our empirical results demonstrate the effectiveness of Decision Stacks for offline policy optimization for several MDP and POMDP environments, outperforming existing methods and enabling flexible generative decision making. |
|
|
Illustration for the Decision Stacks framework for learning reinforcement learning agents using probabilistic inference. In contrast to a time-induced ordering, we propose a modular design that seggregates the modeling of observation, rewards, and action sequences. Each module can be flexibly parameterized via any generative model and the modules are chained via an autoregressive dependency graph to provide high overall expressivity. |
Offline Reinforcement Learning Performance in POMDP. We generate the POMDPs datasets from the D4RL locomotion datasets. Decision Stacks (DS), consistently achieves competitive or superior results compared to the other algorithms, including BC, DT, TT, and DD. Notably, DS outperforms other methods in most environments and attains the highest average score of 74.3, which reflects a 15.7% performance improvement over the next best-performing approach Diffuser. This highlights the effectiveness of our approach in handling POMDP tasks by more expressively modeling the dependencies among observations, actions, and rewards. |
Offline Reinforcement Learning Performance in MDP. Decision Stacks outperforms or is competitive with the other baselines on 6/9 environments and is among the highest in terms of aggregate scores. These results suggest that even in environments where we can make appropriate conditional independence assumptions using the MDP framework, the expressivity in the various modules of Decision Stacks is helpful for test-time generalization. |
We test for the planning capabilities of Decision Stacks on the Maze2D task from the D4RL benchmark. This is a challenging environment requiring an agent to generate a plan from a start location to a goal location. The demonstrations contain a sparse reward signal of +1 only when the agent reaches close to the goal. Our experiments demonstrate that Decision Stacks generates robust trajectory plans and matching action sequences, outperforming baselines with significant improvements through enhanced modularity and flexible modelling. |
Example rollouts on the Maze2D-medium-v1 environment, where the start positions are consistent across each column, and the goal position is located at the bottom right corner of the maze. The trajectory waypoints are color-coded, transitioning from blue to red as time advances. The bottom two rows demonstrate that Diffuser, DD, and DS can generate good plans that can be executed well with a handcoded controller. However, the respective action models result in differing executions. Compared to DD and Diffuser, DS generates more flexible and reliable trajectories that align closely with future waypoints planned by the observation model towards the goal. |
Decision Stacks distinctly separates the prediction of obser- vations, rewards, and actions employing three distinct mod- els that can be trained independently using teacher forcing. We explore the additional flexibility offered by different architecture choices for each module. We display a combination of 2x3x3 policy agents for the Hopper-medium v2 POMDP environment. Since we adopt a modular structure, we can compose the different modules efficiently and hence, we only needed to train 2 (state) + 3 (reward) + 3 (action) models. |
Acknowledgements |