RT-2, published by Google DeepMind in July 2023, is widely credited as the first model to establish the VLA concept. Its core idea is simple and general: express robot actions as text tokens and add them to the training set of a vision-language model exactly as if they were language. A VLM can then be co-fine-tuned on both internet-scale vision-language tasks (such as visual question answering) and robot trajectory data, so a single network both answers questions about images and outputs actions. Two instantiations were built: RT-2-PaLI-X (up to 55B parameters) and RT-2-PaLM-E (up to 12B). Across roughly 6,000 evaluation trials, RT-2 showed markedly improved generalisation to novel objects and instructions, plus emergent abilities — interpreting unseen symbols, doing rudimentary reasoning such as choosing an improvised hammer, and recognising people and semantic categories — none of which appeared in the robot training data. Limitations: it is a closed research model with no public weights or API; the 55B variant runs only at about 1-3 Hz, limiting dynamic control; and it targets single-arm manipulation. RT-2 is the conceptual ancestor of nearly every model in this directory.
RT-2 (Robotic Transformer 2) is the model that defined the vision-language-action category. It fine-tunes large web-trained vision-language models to emit robot actions as text tokens, transferring internet knowledge directly into robotic control.
RT-2, published by Google DeepMind in July 2023, is widely credited as the first model to establish the VLA concept. Its core idea is simple and general: express robot actions as text tokens and add them to the training set of a vision-language model exactly as if they were language. A VLM can then be co-fine-tuned on both internet-scale vision-language tasks (such as visual question answering) and robot trajectory data, so a single network both answers questions about images and outputs actions. Two instantiations were built: RT-2-PaLI-X (up to 55B parameters) and RT-2-PaLM-E (up to 12B). Across roughly 6,000 evaluation trials, RT-2 showed markedly improved generalisation to novel objects and instructions, plus emergent abilities — interpreting unseen symbols, doing rudimentary reasoning such as choosing an improvised hammer, and recognising people and semantic categories — none of which appeared in the robot training data. Limitations: it is a closed research model with no public weights or API; the 55B variant runs only at about 1-3 Hz, limiting dynamic control; and it targets single-arm manipulation. RT-2 is the conceptual ancestor of nearly every model in this directory.
