RT-2 (Robotics Transformer 2) is a Vision-Language-Action model published by Google DeepMind in 2023 that pioneered the modern VLA paradigm. It builds on Google’s pretrained vision-language models PaLI-X and PaLM-E (up to 55B parameters) and co-fine-tunes them on robot trajectory data alongside their original web-scale visual question answering and image captioning data. The key insight is to express robot actions as another natural-language token sequence, so the same transformer that learned to describe a kitchen can also output joint commands. Across 6,000 evaluation trials, RT-2 demonstrated significant improvements over RT-1 in generalising to novel objects, interpreting commands not present in the robot training data (such as ‘place the banana on the number 2’), and performing rudimentary semantic reasoning (‘pick up the smallest object’). With chain-of-thought prompting, it can carry out multi-step semantic plans — for example, identifying a rock as an improvised hammer or an energy drink for a tired person. RT-2 itself was not publicly released, but its design directly inspired open-source successors including OpenVLA, Ï€0, and OpenAI’s robotics work, making it the architectural template for the current VLA generation.
Vision-language-action model from Google DeepMind that co-fine-tunes large web-scale VLMs (PaLI-X, PaLM-E) with robot trajectory data. Robot actions are encoded as text tokens, allowing the model to inherit chain-of-thought reasoning and generalise to novel objects and instructions seen only on the internet.
RT-2 (Robotics Transformer 2) is a Vision-Language-Action model published by Google DeepMind in 2023 that pioneered the modern VLA paradigm. It builds on Google’s pretrained vision-language models PaLI-X and PaLM-E (up to 55B parameters) and co-fine-tunes them on robot trajectory data alongside their original web-scale visual question answering and image captioning data. The key insight is to express robot actions as another natural-language token sequence, so the same transformer that learned to describe a kitchen can also output joint commands. Across 6,000 evaluation trials, RT-2 demonstrated significant improvements over RT-1 in generalising to novel objects, interpreting commands not present in the robot training data (such as ‘place the banana on the number 2’), and performing rudimentary semantic reasoning (‘pick up the smallest object’). With chain-of-thought prompting, it can carry out multi-step semantic plans — for example, identifying a rock as an improvised hammer or an energy drink for a tired person. RT-2 itself was not publicly released, but its design directly inspired open-source successors including OpenVLA, Ï€0, and OpenAI’s robotics work, making it the architectural template for the current VLA generation.
