Skip to content

Askdroid

Menu
  • Home
    • About Us
    • Contact us
  • AI
  • Robotics
  • Podcasts
  • News
  • Blog
Menu
  • Home
    • About Us
    • Contact us
  • AI
  • Robotics
  • Podcasts
  • News
  • Blog
Google RT 2 1 768x362
Previous Next
Ai Category: Vision-Language-Action ModelsAi Tags: Embodied AI Foundation Model Manipulation research vision-language-action
  • Profile
  • Title
  • Short Description
  • Description
  • Tags
  • Company Name
  • Category
  • Country
  • License
  • Stage
  • Model Size
  • Hardware Requirement
  • API
  • Documentation
  • Paper / Publication
  • Robots Using

RT-2, published by Google DeepMind in July 2023, is widely credited as the first model to establish the VLA concept. Its core idea is simple and general: express robot actions as text tokens and add them to the training set of a vision-language model exactly as if they were language. A VLM can then be co-fine-tuned on both internet-scale vision-language tasks (such as visual question answering) and robot trajectory data, so a single network both answers questions about images and outputs actions. Two instantiations were built: RT-2-PaLI-X (up to 55B parameters) and RT-2-PaLM-E (up to 12B). Across roughly 6,000 evaluation trials, RT-2 showed markedly improved generalisation to novel objects and instructions, plus emergent abilities — interpreting unseen symbols, doing rudimentary reasoning such as choosing an improvised hammer, and recognising people and semantic categories — none of which appeared in the robot training data. Limitations: it is a closed research model with no public weights or API; the 55B variant runs only at about 1-3 Hz, limiting dynamic control; and it targets single-arm manipulation. RT-2 is the conceptual ancestor of nearly every model in this directory.

Google RT-2

RT-2 (Robotic Transformer 2) is the model that defined the vision-language-action category. It fine-tunes large web-trained vision-language models to emit robot actions as text tokens, transferring internet knowledge directly into robotic control.

RT-2, published by Google DeepMind in July 2023, is widely credited as the first model to establish the VLA concept. Its core idea is simple and general: express robot actions as text tokens and add them to the training set of a vision-language model exactly as if they were language. A VLM can then be co-fine-tuned on both internet-scale vision-language tasks (such as visual question answering) and robot trajectory data, so a single network both answers questions about images and outputs actions. Two instantiations were built: RT-2-PaLI-X (up to 55B parameters) and RT-2-PaLM-E (up to 12B). Across roughly 6,000 evaluation trials, RT-2 showed markedly improved generalisation to novel objects and instructions, plus emergent abilities — interpreting unseen symbols, doing rudimentary reasoning such as choosing an improvised hammer, and recognising people and semantic categories — none of which appeared in the robot training data. Limitations: it is a closed research model with no public weights or API; the 55B variant runs only at about 1-3 Hz, limiting dynamic control; and it targets single-arm manipulation. RT-2 is the conceptual ancestor of nearly every model in this directory.

Embodied AI, Foundation Model, Manipulation, research, and vision-language-action
Google DeepMind
Vision-Language-Action Models
United States
Research-only (closed; no public weights or API)
Research prototype
55B (also 12B and 5B variants)
Cloud-only (data-centre / TPU-class; not released for local inference)
None (research model, not released)
Documentation URL
Brohan et al., 'RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control' (2023) - arXiv:2307.15818
The Google Robot mobile manipulator (the Everyday Robots single-arm platform) used in Google DeepMind's labs. Research evaluations only; never offered as a product.

Recent Posts

  • Wayve Robotaxi: How a Cambridge Startup Is Rivaling Waymo Without a Single LiDAR
  • Versius Plus and the Gynecology Frontier: CMR Surgical’s FDA Submission and the Future of U.S. Surgical Robotics
  • Autonomous Drone Inspection in 2026: How Industrial Drones Are Replacing Human Inspectors
  • Amazon Sequoia: The Next-Generation Warehouse Robot Arriving in 2026
  • Pudu Robotics Raises 50M and Pivots to Industrial AMR Market in 2026

Recent Comments

No comments to show.

Archives

  • May 2026
  • April 2026
  • October 2024
  • August 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023

Categories

  • Blog
  • News
  • Podcast
  • Uncategorized

Agriculture & Farming
AI Software & SaaS
Autonomous Systems
Aviation & Aerospace
Civil Engineering & Geospatial
Construction & Infrastructure
Defense & Security
Energy & Renewables
General Purpose & Humanoid
Hardware & Components
Healthcare & Medical
Hospitality & Wellness
Industries
Logistics & Warehousing
Manufacturing & Industrial
Product Type
Public Safety & Emergency
R&D & Developer Tools
Robotics Integration & Services
Robots & Automated Systems

Edge AI Hardware for Droids
Motion Planning & Control
Multimodal LLMs for Embodied AI
Robot Foundation Models
Safety & Alignment for Physical Robots
Simulation Platforms
Speech & Dialogue for Droids
Teleoperation & Data Collection Tools
Vision & Perception AI
Vision-Language-Action Models

Let's get in touch with us

At the intersection of innovation and technology, we are pioneers crafting a landscape for the digital age.
Please enable JavaScript in your browser to complete this form.
Name *
Loading

Contact Us

Call Us

+44 (0) 1483 870170

Email:

info@askdroid.com

Follow Us on

Copyright © 2026, Askdroid. All Rights Reserved
  • Home
    • About Us
    • Contact us
  • AI
  • Robotics
  • Podcasts
  • News
  • Blog
Change Location
Find awesome listings near you!