OpenVLA is a 7-billion-parameter open-source Vision-Language-Action (VLA) model released in 2024 by a Stanford / UC Berkeley / Toyota Research Institute / Google DeepMind collaboration. Trained on 970,000 robot demonstrations from the Open X-Embodiment dataset, it pairs a Llama 2 language model backbone with a dual visual encoder that fuses DINOv2 spatial features with SigLIP semantic features. Given a camera image and a natural-language instruction, the model predicts 7-dimensional robot control actions as discrete tokens. OpenVLA was specifically built to address the closed nature of prior VLA models such as RT-2 and to provide best practices for fine-tuning on commodity GPUs. Despite being seven times smaller, it reportedly outperforms RT-2-X (55B) by roughly 16.5 percentage points across 29 manipulation tasks and multiple embodiments. The full codebase, training notebooks, and weights are openly released, making OpenVLA a foundational reference for any team building a generalist robot policy. Typical use cases include tabletop manipulation on Franka, WidowX, and other research arms, with extensive community support for LoRA-style fine-tuning to new platforms.
Open-source 7B-parameter vision-language-action model trained on 970,000 real-world robot demonstrations. Built on a Llama 2 backbone with a fused DINOv2 + SigLIP vision encoder. Generalises across multiple robot embodiments and supports parameter-efficient fine-tuning for new robots.
OpenVLA is a 7-billion-parameter open-source Vision-Language-Action (VLA) model released in 2024 by a Stanford / UC Berkeley / Toyota Research Institute / Google DeepMind collaboration. Trained on 970,000 robot demonstrations from the Open X-Embodiment dataset, it pairs a Llama 2 language model backbone with a dual visual encoder that fuses DINOv2 spatial features with SigLIP semantic features. Given a camera image and a natural-language instruction, the model predicts 7-dimensional robot control actions as discrete tokens. OpenVLA was specifically built to address the closed nature of prior VLA models such as RT-2 and to provide best practices for fine-tuning on commodity GPUs. Despite being seven times smaller, it reportedly outperforms RT-2-X (55B) by roughly 16.5 percentage points across 29 manipulation tasks and multiple embodiments. The full codebase, training notebooks, and weights are openly released, making OpenVLA a foundational reference for any team building a generalist robot policy. Typical use cases include tabletop manipulation on Franka, WidowX, and other research arms, with extensive community support for LoRA-style fine-tuning to new platforms.