Octo is a transformer-based diffusion policy pre-trained on 800,000 robot episodes from the Open X-Embodiment dataset. It is deliberately lightweight, released in two sizes: Octo-Small (27M parameters) and Octo-Base (93M parameters, equivalent to a ViT-B). Images are encoded with a lightweight convolutional tokeniser and split into patches; language is encoded with a T5-Base text encoder. A modular block-wise attention structure lets the model accept different inputs — one or more RGB cameras, wrist cameras, goal images or language instructions — simply by changing which tokens are fed in. Actions are produced by a small conditional diffusion head that predicts continuous, multi-modal action distributions, with only one transformer forward pass per action. Crucially, Octo can be adapted to new sensory inputs (such as force-torque feedback), new action spaces (joint-position control) and new robot morphologies by adding adapters and fine-tuning on a small dataset with an accessible compute budget. Out of the box it outperforms RT-1-X and performs comparably to the far larger 55B RT-2-X on language-conditioned tasks. Limitations: as a compact model it has weaker language grounding than 7B-class VLAs; it targets table-top manipulation; and it is a research project rather than a supported product. Octo is a popular efficient baseline for labs that cannot run 7B-plus models.
Octo is a small, fully open-source generalist robot policy built as a transformer-based diffusion model. Its modular design lets it be quickly fine-tuned to new robots, sensors and action spaces on a modest compute budget.
Octo is a transformer-based diffusion policy pre-trained on 800,000 robot episodes from the Open X-Embodiment dataset. It is deliberately lightweight, released in two sizes: Octo-Small (27M parameters) and Octo-Base (93M parameters, equivalent to a ViT-B). Images are encoded with a lightweight convolutional tokeniser and split into patches; language is encoded with a T5-Base text encoder. A modular block-wise attention structure lets the model accept different inputs — one or more RGB cameras, wrist cameras, goal images or language instructions — simply by changing which tokens are fed in. Actions are produced by a small conditional diffusion head that predicts continuous, multi-modal action distributions, with only one transformer forward pass per action. Crucially, Octo can be adapted to new sensory inputs (such as force-torque feedback), new action spaces (joint-position control) and new robot morphologies by adding adapters and fine-tuning on a small dataset with an accessible compute budget. Out of the box it outperforms RT-1-X and performs comparably to the far larger 55B RT-2-X on language-conditioned tasks. Limitations: as a compact model it has weaker language grounding than 7B-class VLAs; it targets table-top manipulation; and it is a research project rather than a supported product. Octo is a popular efficient baseline for labs that cannot run 7B-plus models.
