For decades, the promise of a household robot has been just that—a promise. We were supposed to have Rosie the Robot by now, but instead, we have disc-shaped vacuums that get stuck on bathmats. The gap between science fiction and our domestic reality is vast, littered with the corpses of failed startups and overhyped demos. But a new competition, the BEHAVIOR Challenge, set to debut at NeurIPS 2025, is poised to drag the field, kicking and screaming, into the real world. Or at least, a very, very convincing simulation of it.
The challenge is simple in its goal and brutal in its execution: make a robot do actual chores. Not just picking up a block, but completing complex, multi-step activities that humans find mundane. BEHAVIOR, which stands for Benchmark for Everyday Household Activities in Virtual, Interactive, and Realistic environments, isn’t just another robotics benchmark; it’s a full-blown domestic gauntlet designed to break today’s state-of-the-art AI. And frankly, it’s about time someone did.
Welcome to the Uncanny Valley Household
At the heart of the BEHAVIOR Challenge is a deeply sophisticated simulation environment that makes most robotics sandboxes look like a child’s playpen. This is no sterile lab; it’s a high-fidelity, physics-based world where things get messy. The benchmark is built on three pillars:
- 1,000 Everyday Tasks: Forget stacking cubes. We’re talking about tasks like “Assembling Gift Baskets,” “Cleaning Up Plates and Food,” and the existentially dreadful “Putting Away Halloween Decorations.” Each task is formally defined in the BEHAVIOR Domain Definition Language (BDDL), which specifies the initial state and the precise conditions for success.
- 50 Interactive Environments: These are not just static rooms but fully interactive, house-scale layouts populated with around 10,000 manipulable objects. A fridge can be opened, a tomato can be sliced, and a cloth can be, well, deformed.
- The OmniGibson Simulator: Built on NVIDIA’s Omniverse platform, this is where the magic (and physics) happens. OmniGibson supports not just rigid-body physics but also advanced phenomena like deformable objects, fluid interactions, and complex state changes like heating, cooling, and cutting. This is what separates it from predecessors, allowing for a level of realism crucial for training robots that might one day encounter a real kitchen.
This isn’t just a test of manipulation or navigation in isolation. BEHAVIOR is the first benchmark of its kind that demands a robot perform high-level reasoning, long-range navigation, and dexterous bimanual manipulation all at once. To succeed, an AI can’t just be good at one thing; it has to be good at thinking like a (very patient) human.
The NeurIPS 2025 Gauntlet
For its inaugural run at NeurIPS 2025, the challenge is unleashing 50 of these full-length tasks upon the global research community. Contestants will have to program a virtual robot to tackle scenarios that can take several minutes to complete, spanning multiple rooms and involving dozens of sub-goals. Think “Make Pizza” or “Wash Dog Toys”—tasks that require planning, memory, and a whole lot of digital elbow grease.
The default robot for this trial-by-simulation is Galaxea’s R1 Pro, a wheeled humanoid with two 7-DOF arms, a 4-DOF torso, and a suite of sensors. This isn’t some clumsy tin can; its design is explicitly chosen for the kind of reach, stability, and bimanual coordination essential for household activities.
To prevent participants from having to bootstrap their AI from a state of primordial ignorance, the organizers are providing a massive dataset: 10,000 expert demonstrations, totaling over 1,200 hours of meticulously recorded data. This isn’t shaky, amateur footage. It’s clean, near-optimal data collected by vendor Simovation using the JoyLo teleoperation system. JoyLo, a clever setup using handheld controllers on kinematic-twin arms, allows human operators to guide the robot smoothly through tasks, providing a perfect template for imitation learning.
Why This is So Damn Hard
The term “long-horizon” gets thrown around a lot in AI, but BEHAVIOR gives it teeth. A task like “Boxing Books Up for Storage” might require the robot to navigate to the living room, identify the correct books, find a box in the garage, bring it back, and then sequentially place each book inside. This tests planning and memory over extended periods in a way few benchmarks ever have.
Furthermore, the sheer diversity of object interactions is staggering. Robots must understand and execute skills far beyond grasping. They’ll need to pour liquids, wipe surfaces, cut vegetables, and toggle switches. Objects can be opened, closed, heated, frozen, cleaned, or even set on fire. This rich set of required skills—at least 30 distinct primitives—forces researchers to move beyond single-task models and toward more generalized, adaptable intelligence.
To make the challenge accessible, organizers are providing several baseline models, including standards like ACT and Diffusion Policy, as well as pre-trained models like OpenVLA. The entire framework is open-source, complete with starter kits and tutorials to lower the barrier to entry.
How Do You Judge a Robotic Butler?
Success in the BEHAVIOR Challenge is primarily measured by the task success rate. The system uses the BDDL definitions to check if the robot has satisfied all the goal conditions. Partial credit is awarded, encouraging solutions that make meaningful progress even if they don’t achieve perfection.
Secondary metrics will also be tracked to separate the clever from the clumsy:
- Efficiency: Time taken, distance traveled, and total joint movement will be measured. An elegant solution is a fast one.
- Data Utilization: The organizers will note how much of the 1,200 hours of demonstration data was used to train each submission, providing insights into data efficiency.
The competition officially launched on September 2nd, 2025, with final submissions due by November 16th. The winners, who will be announced at the NeurIPS conference in San Diego in December, will receive cash prizes—a modest $1,000 for first place—but the real prize is the bragging rights and the chance to meaningfully advance the field of embodied AI.
Ultimately, the BEHAVIOR Challenge is more than just a competition; it’s a reality check for the entire robotics industry. It’s a meticulously designed crucible to test whether our algorithms are ready to move out of the lab and into the chaotic, unpredictable, and often sticky environment of a human home. The results from NeurIPS 2025 won’t just show us who has the best model; they’ll show us how far we have to go before our robot helpers are ready to do the dishes.






