1X's NEO Robot Learns by Watching Videos, Not Tedious Training

The robotics industry has a dirty little secret: teaching robots to do anything useful is agonizingly slow and colossally expensive. For years, the prevailing wisdom has been to brute-force intelligence with Vision-Language-Action models (VLAs), which demand tens of thousands of hours of humans meticulously puppeteering robots through every conceivable task. It’s a data bottleneck of epic proportions.

Now, robotics firm 1X is proposing a solution that borders on heresy. Their new approach for the NEO humanoid is deceptively simple: stop the painstaking lessons and just let the robot learn by watching the vast, chaotic, and endlessly instructive library of human behavior we call the internet. This isn’t just an upgrade; it’s a fundamental shift in how a robot can acquire skills.

The Data-Hungry Beast of Yesterday

To appreciate the jump 1X is making, you have to understand the status quo. Most modern foundation models for robotics, from Figure’s Helix to Nvidia’s GR00T, are VLAs. These models are powerful, but they are insatiably hungry for high-quality, robot-specific demonstration data. This means paying people to tele-operate robots for thousands of hours to collect examples of, say, picking up a cup or folding a towel.

This approach is a major impediment to creating truly general-purpose robots. It’s expensive, it doesn’t scale well, and the resulting models can be brittle, failing when faced with an object or environment they haven’t seen before. It’s like trying to teach a child to cook by only letting them watch you in your own kitchen, instead of letting them binge-watch every cooking show ever made.

A screenshot of the webpage for pi*0.6, an example of a Vision-Language-Action model that learns from experience.

Dream a Little Dream of… Doing Chores

The 1X World Model (1XWM) chucks that playbook out the window. Instead of directly mapping language to actions, it uses text-conditioned video generation to figure out what to do. It’s a two-part brain that effectively allows the robot to imagine the future before it acts.

First, there’s the World Model (WM), a 14-billion-parameter generative video model that acts as the system’s imagination. You give NEO a text prompt—“pack this orange into the lunchbox”—and the WM, looking at the current scene, dreams up a short, plausible video of the task being completed.

Then, the Inverse Dynamics Model (IDM), the pragmatist in the machine, analyzes that dream. It translates the generated pixels into a concrete sequence of motor commands, bridging the gap between a visual what and a physical how. This process is grounded through a multi-stage training strategy: the model starts with web-scale video, is mid-trained on 900 hours of egocentric human video to get a first-person perspective, and finally fine-tuned on a mere 70 hours of NEO-specific data to adapt to its own body.

Video thumbnail

A clever trick in their training pipeline is “caption upsampling.” Since many video datasets have terse descriptions, 1X uses a VLM to generate richer, more detailed captions. This provides clearer conditioning and improves the model’s ability to follow complex instructions, a technique that has shown similar benefits in image models like OpenAI’s DALL-E 3.

The Humanoid Advantage

This entire video-first approach hinges on a critical, and perhaps obvious, piece of hardware: the robot is shaped like a person. The 1XWM, trained on countless hours of humans interacting with the world, has developed a deep, implicit understanding of physical priors—gravity, momentum, friction, object affordances—that transfer directly because NEO’s body moves in a fundamentally human-like way.

As 1X puts it, the hardware is a “first-class citizen in the AI stack.” The kinematic and dynamic similarities between NEO and a human mean the model’s learned priors generally remain valid. What the model can visualize, NEO can, more often than not, actually do. This tight integration of hardware and software closes the often-treacherous gap between simulation and reality.

From Theory to Reality (With Some Stumbles)

The results are compelling. 1XWM allows NEO to generalize to tasks and objects it has zero direct training data for. The promotional video shows it steaming a shirt, watering a plant, and even operating a toilet seat—a task for which it had no prior examples. This suggests the knowledge for two-handed coordination and complex object interaction is being successfully transferred from the human video data.

But this isn’t magic. The system has its limitations. Generated rollouts can be “overly optimistic” about success, and its monocular pretraining can lead to weak 3D grounding, causing the real robot to undershoot or overshoot a target even when the generated video looks perfect. Success rates on dextrous tasks like pouring cereal or drawing a smiley face remain challenging.

However, 1X has found a promising way to boost performance: test-time compute. For a “pull tissue” task, the success rate jumped from 30% with a single video generation to 45% when the system was allowed to generate eight different possible futures and select the best one. While this selection is currently manual, it points to a future where a VLM evaluator could automate the process, significantly improving reliability.

The Self-Teaching Flywheel

The 1XWM represents more than an incremental update; it’s a potential paradigm shift that could break the data bottleneck wide open. It creates a flywheel for self-improvement. By being able to attempt a broad set of tasks with a non-zero success rate, NEO can now generate its own data. Every action, whether a success or a failure, becomes a new training example that can be fed back into the model to refine its policy. The robot begins to teach itself.

Of course, major hurdles remain. The WM currently takes 11 seconds to generate a 5-second plan, with another second for the IDM to extract the actions. That latency is an eternity in a dynamic, real-world environment and a non-starter for reactive tasks or delicate, contact-rich manipulation.

Still, by tackling the data problem head-on, 1X may have just kicked open the door to a future where robots learn not from our tedious instruction, but from our collective, recorded experience. That future is accelerating, one internet video at a time.