HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

Overview

A human in every episode

Existing robot manipulation datasets are collected in human-absent settings, so policies trained on them can perform a task in isolation but fail to act appropriately when a person is present.
HABIT closes this gap, with the goal of robot policies that can work alongside people.

Workspace setup: two robot arms, a human, and five RGB cameras marked in red

The collection setup

Bimanual manipulation captured with two Franka Research 3 arms and five synchronized RGB cameras, with a human sharing the workspace in every episode.

Robot side (3): two wrist cameras and one egocentric view facing the shared workspace.
Human side (2): a head-mounted egocentric view and an exocentric view of the whole interaction.

Episodes

Hours of manipulation

Tasks

Unique human-robot subtask pairs

Task design

Three interaction roles

Human-robot interaction takes many forms. To capture this diversity, we adopt three roles from the Human-Robot Interaction (HRI) literature, each defined by how the human and robot relate within a subtask.

Collaborator

20 tasks · 3,198 episodes

Act together

The human and robot jointly accomplish a shared goal through direct physical interaction (e.g., handing over an object or jointly holding a bucket), and the robot must coordinate with the human both spatially and temporally.

Coworker

20 tasks · 3,969 episodes

Work side by side

The human and robot perform separate tasks while sharing only the workspace; the robot must avoid collisions with the human to ensure safety.

Supervisor

20 tasks · 3,396 episodes

Follow instructions

The human directs the robot through explicit cues such as gestures or demonstrated actions, and the robot must infer the human's intent from visual input alone.

Sample HABIT tasks grouped by the three roles — A preview of HABIT, spanning the three human-robot interaction roles at its core: Collaborator, Coworker, and Supervisor.

Collaborator tasks — A preview of HABIT, spanning the three human-robot interaction roles at its core: Collaborator, Coworker, and Supervisor.

A single instruction such as "clean the shelf together" is too ambiguous to elicit consistent demonstrations. So each task follows a task workflow: a directed graph of human (H_i) and robot (R_i) subtasks that fixes how the two agents divide and sequence the work.

The task workflow, in motion

A representative workflow per role, with the robot's and human's cameras in sync.

Plays through automatically — the highlighted subtask advances with the video. Click any node to jump to that step.

Robot view

Human view

Table Service · Collaborator. The human and robot take turns; each subtask waits for the partner to finish the previous one.

Robot view

Human view

Shelf Cleaning · Coworker. Two independent chains that share only the workspace.

Robot view

Human view

Cup on Coaster · Supervisor. The robot reads the human's placement cue from vision alone.

How we collected the data

Reactive interaction

The rule behind every HABIT recording: each robot action responds to a cue the cameras can actually see.

Because the robot operator knows the subtask sequence in advance, they could pre-execute the next subtask from memory rather than reacting to the human. For example, the robot operator might begin reaching toward a target object before the human ever points to it.

This quietly breaks learning: the cue that triggers the action then falls outside the camera input, so no policy could ever recover the behavior from the recorded data.

"Each operator acts only after directly observing the partner's behavior. We prohibit any coordination signal that is not captured in the recorded observations, so every demonstrated action is grounded in cues that are also available to the policy."

Reactive interaction. The robot operator acts only after an observable cue, so every trigger is captured in the data.

Beyond reactivity

Eliciting human-aware behavior

Reactivity alone does not guarantee the human-aware behavior we want, so the collection protocol adds three deliberate design choices.

Yielding under safety-first priority

The robot yields to keep the human safe

Human safety is the overriding constraint. Whenever the robot is about to collide with the human or a human-held object, the robot operator decisively retracts the arms rather than continuing the trajectory, so yielding becomes the robot's default recorded response.

Temporal adaptation

Matching the human's tempo

The human operator's movement speed is deliberately varied across episodes, so the data captures the robot adapting its pace to the human rather than moving at a single fixed cadence.

Gesture grounding

No shortcut to the gesture

The risk is not that the robot follows words, but that it takes a shortcut and acts before the human has even pointed. So during collection, the human operator waits a random interval before gesturing, forcing the recorded behavior to depend on actually reading and grounding the gesture.

Experiments

Human-aware behavior emerges from data

We fine-tune two open-source VLAs, π_0.5 and GR00T N1.6, on a six-task subset. Training steps and batch size are held fixed, so the dataset is the only variable. Each number is the mean over 20 trials.

What is Robot-only? The conventional baseline: the same tasks, environment, and robot operator as HABIT, but recorded with no co-present human. The robot does only its own share of the work in an empty workspace, and for Supervisor tasks it receives the same indexed instruction without the human's pointing gesture.

Success rate across the six evaluation tasks

Higher is better. Collaborator tasks need a human partner, so Robot-only is not applicable.

Robot-only HABIT (ours)

HABIT improves success on every comparable task, with the largest gains on Coworker tasks, where workspace overlap demands reactive yielding (Box Packaging: 10% to 85% for π_0.5). The same pattern holds for GR00T N1.6, so the gain comes from the data: human-present demonstrations teach what conventional Robot-only data cannot.

Robot-only vs. HABIT, side by side

Real policy rollouts on the four tasks that have a Robot-only baseline. Watch how the HABIT policy yields and follows the human where Robot-only does not.

Waste Sorting Coworker

Box Packaging Coworker

Donut Serving Supervisor

Food Storage Supervisor

Role-specific failures, sharply reduced

Each role has a characteristic failure mode. HABIT instills the behaviors that human-absent data cannot teach.

Role-specific failure modes: precondition following, collision, and gesture following — What each failure looks like. The robot should follow the **desired** path (green), but a **violation** (red) means acting before a precondition is met (Collaborator), colliding in shared space (Coworker), or ignoring the human's gesture (Supervisor). The charts below measure how often each occurs.

Precondition following failure (Collaborator) — What each failure looks like. The robot should follow the **desired** path (green), but a **violation** (red) means acting before a precondition is met (Collaborator), colliding in shared space (Coworker), or ignoring the human's gesture (Supervisor). The charts below measure how often each occurs.

Two evaluation tasks per role, both models shown. Lower is better; Robot-only is not applicable for Collaborator tasks (which need a human partner).

Robot-only HABIT (ours)

HABIT sharply reduces every role-specific failure across both models. The improvement is largest for π_0.5 on Waste Sorting, where collisions fall from 60% to 0%.

A strong prior for new tasks

Mid-training π_0.5 on HABIT, then fine-tuning on a new task, beats fine-tuning from scratch (labeled direct fine-tuning in the plot below) at every data budget. HABIT transfers as a reusable prior for human-robot interaction.

Direct fine-tuning HABIT mid-training + fine-tuning

On Waste Sorting, mid-training with just 50 demonstrations matches direct fine-tuning at 200. On Shelf Cleaning, direct fine-tuning never exceeds 45% even at 200 demos, while mid-training reaches 60% with only 100 demonstrations.

HABIT: Human-Aware Behavior and Interaction
Training Dataset for Robot Manipulation

What HABIT enables

A human in every episode

The collection setup

Three interaction roles

Act together

Work side by side

Follow instructions

The task workflow, in motion

Reactive interaction

Eliciting human-aware behavior

The robot yields to keep the human safe

Matching the human's tempo

No shortcut to the gesture

Human-aware behavior emerges from data

Success rate across the six evaluation tasks

Robot-only vs. HABIT, side by side

Role-specific failures, sharply reduced

A strong prior for new tasks

BibTeX