logoUnlogoeashing the Power of Pre-trained Language Models
for Offline Reinforcement Learning

Ruizhe Shi1*   Yuyao Liu1*   Yanjie Ze2   Simon S. Du3   Huazhe Xu124
1Tsinghua University, IIIS   2Shanghai Qi Zhi Institute   3University of Washington   4Shanghai AI Lab  

*Equal contribution. Order is decided by coin flip.

Accepted by ICLR 2024.


Can we unleash the power of pre-trained LMs to solve
sequential decision-making problems?

Overview

Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces Language Models for Motion Control (LaMo), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate LaMo achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Below we show the average performance over different data ratios. (Medium for MuJoCo and Atari, Complete and Partial for Kitchen.)

Highlights

We propose LaMo, an offline RL framework that leverages the pre-trained Language Models (LMs) for low-level Motion control. On sparse-reward tasks, LaMo achieves strong results and surpasses recent strong algorithms CQL, IQL, TD3+BC, and DT; On dense-reward tasks, LaMo significantly improves Decision Transformer and closes the gap between value-based methods and DT-based methods. Notably, in low-data scenarios, our method demonstrates powerful few-shot learning ability, which can be attributed to the inductive bias from pre-trained LMs.

We look into the relationship between the performance of various algorithms and the scale of data. As depicted in the Figure, LaMo is capable of achieving excellent performance even with relatively small datasets. For example, in Hopper, LaMo surpasses the performance of CQL and DT when the sample ratio of data is 0.5% and maintains this advantage consistently as the sample ratio increases.

Below, we visualize 8 tasks across 3 domains that we consider.

Hopper
Walker2d
Halfcheetah
Reacher
Breakout
Qbert
Pong
Kitchen

Method

LaMo encompasses several crucial designs:

  1. We adopt a pre-trained LM (i.e. GPT-2) as the initialization of a Decision Transformer (DT);
  2. We replace the linear embedding projections with MLPs to augment representation learning capabilities for complicated tasks;
  3. During training the offline RL agents, we freeze the pre-trained parts and utilize the parameter-efficient fine-tuning technique LoRA, where the trainable parameters account for only 0.7% of the entire model;
  4. We introduce language prediction as an auxiliary objective while finetuning, in order to stabilize the performance and maintain the language ability.

Citation

@inproceedings{ shi2024LaMo, title={Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning}, author={Ruizhe Shi and Yuyao Liu and Yanjie Ze and Simon Shaolei Du and Huazhe Xu}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=AY6aM13gGF} }