I magine teaching a robot to load a dishwasher. This seems like a task a machine learning engineer should solve in a weekend-pick up a plate, slide it into the rack, repeat. But anyone who has actually loaded a dishwasher knows the reality is far more complex. A plate must be tilted to fit in a tight space. A bowl needs to be angled, not upright. Glasses benefit from a gentle slide rather than a drop. Small items require different manipulation strategies than large ones.
MIT’s research into Robotic Manipulation revealed something fundamental: everyday manipulation tasks require rich understanding of physics, contact, friction, and subtle adjustments. Humans solve these problems unconsciously, our hands gathering information through pressure, friction, and resistance. Robots currently lack this embodied knowledge, which is why robotic manipulation remains one of the hardest problems in robotics-not because algorithms are insufficient, but because training data doesn’t capture the necessary complexity.
This gap between simplistic manipulation tasks (pick and place) and real-world tasks (load, arrange, adjust) is where most robotic learning systems fail. The training data is the bottleneck, not the algorithms.
Why Current Robotics Data Falls Short
Most robotics datasets focus on pick-and-place tasks: a robot learns to grasp an object and move it to a target location. These tasks are relatively straightforward to demonstrate and label. But the real world demands more sophisticated manipulation.
Consider a robot tasked with organizing items in a cluttered drawer. Some items are soft (clothing) and require gentle handling. Others are fragile (ceramics) and require careful support. Some items need to be oriented in specific ways. The robot must understand which manipulation strategy applies to which object, how to adjust its grip based on feedback, and how to recover when something goes wrong.
Current datasets often capture only successful demonstrations of simple motions. They lack the rich detail required to train systems that can handle variability. A plate might be wet or dry, clean or dirty, old or new. Each variant changes the optimal manipulation strategy. If your training data only shows one type of plate in one condition, your model can’t generalize.
The depth of the problem becomes clearer when you consider contact-rich manipulation. Many everyday tasks (loading a dishwasher, folding clothes, assembling objects) involve sustained contact with the object being manipulated. The robot must sense this contact, adjust its motion based on feedback, and sometimes deliberately exploit friction or use contact as a navigational cue. Traditional manipulation data doesn’t capture this dimension.
Three Data Paradigms for Robotic Learning
As the robotics field has matured, three distinct approaches to data collection have emerged, each with different tradeoffs.
Imitation Learning from Demonstrations
The oldest and most direct approach: humans perform the task, and we record their behavior. A person loads a dishwasher while sensors capture hand position, orientation, and pressure. The robot learns to reproduce these motions.
This approach works well for tasks where the human motion directly translates to robot motion. But most manipulation tasks don’t translate directly. Humans have different arm lengths, different dexterity, different sensing modalities. A motion that works for human hands might fail for a robotic gripper. And humans often rely on unconscious adjustments based on proprioception and tactile feedback that robots don’t have access to.
Imitation learning also tends to focus on successful demonstrations. Including failures in the dataset can actually degrade learning because the model learns to predict failure modes rather than success patterns. But understanding failure modes is often crucial for robust manipulation.
Video-Language-Action Pretraining
A more modern approach leverages internet-scale video data. The insight is that humans perform endless manipulation tasks on video (YouTube, TikTok, etc.). These videos contain rich information about how objects are handled, what strategies succeed, and what the physical world expects.
By pretraining models on video-language-action triplets (video of task, language description, action taken), systems can learn general manipulation principles that transfer to robotic tasks. This approach scales to billions of examples and captures enormous diversity in human manipulation strategies.
But video-based learning has limitations. Video captures the external appearance of motion but not the forces, pressures, and contact states that are crucial for manipulation. A video of loading a dishwasher shows the motion but not the forces the human is exerting. The robot, lacking this information, struggles to replicate the task with a different gripper or in a different kitchen layout.
Contact-Rich Manipulation Data
The most sophisticated approach involves collecting data from actual robots performing tasks in real environments, with rich sensors capturing contact, force, pressure, and other physical signals. This data is expensive to collect (it requires expensive hardware and significant setup time) but captures the actual physics of manipulation.
Contact-rich data is crucial for tasks where the robot must respond to real-time feedback. Loading a dishwasher requires the robot to sense contact with the rack, adjust its motion based on that feedback, and sometimes force through friction. This requires training data that includes force/pressure signals, not just position trajectories.
The Hardware Tiers: Choosing Your Data Collection Substrate
The choice of sensors and hardware used to collect training data has profound implications for what the resulting model can learn and do.
Video-Only Data Collection
The most accessible approach uses only camera sensors. A human or robot performs the task, and we record RGB or RGB-D video. This is cheap and scalable but limited in what it captures. Contact forces, pressure distribution, and other tactile information are invisible to cameras.
Video-only data is appropriate for tasks where visual understanding is sufficient. Picking objects from a shelf, basic sorting, or reaching and grasping in uncluttered environments can often be learned from video. But any task that involves sustained contact, force adjustment, or tactile feedback becomes very difficult.
Video Plus Depth Sensors
Adding depth sensing (RGB-D) provides richer information. The model can learn about object size, orientation, and spatial relationships with greater precision. Depth data helps disambiguate occlusions and understands stacking and packing tasks better.
RGB-D data is more informative than RGB alone but still doesn’t capture contact forces. It remains a good choice for manipulation tasks where understanding 3D geometry is more important than understanding forces.
Instrumented Gloves and Full Sensor Suites
The most comprehensive approach instruments the human demonstrator with pressure sensors, force sensors, and inertial sensors. Specialized gloves can capture finger pressure and position. Full robotic arms can log motor currents and joint torques. This data is gold for training-it captures the full richness of how a skilled human manipulates objects.
The tradeoff is obvious: instrumented data collection is expensive and labor-intensive. You can’t scale it to millions of demonstrations. But because it captures the actual physical signals the robot needs to control, the learned models tend to be more robust and generalizable.
The Tiered Labeling Strategy: Weak, Better, Gold
Most robotics projects can’t afford to collect full, instrumented, expert-quality data for every task. Instead, successful projects use a tiered labeling strategy that allocates resources efficiently.
The base tier consists of weak labels at scale. These might be automatically generated (computer vision systems detecting objects and their positions) or collected from less skilled annotators. This tier might include 100,000+ examples. The labels are imperfect, but the scale allows models to learn basic patterns.
The second tier consists of better-quality labels from trained annotators working with clear instructions. These annotators understand the task domain and follow detailed guidelines. Inter-annotator agreement is tracked, and inconsistent labels are reviewed. This tier might include 20,000 examples. The quality is substantially higher, and the model learns more nuanced patterns.
The top tier consists of gold-standard labels from expert annotators (ideally domain experts like roboticists, engineers, or the human demonstrators themselves). These labels are reviewed and verified. Disagreements are resolved through discussion. This tier might include 2,000 examples but captures the most crucial information.
The secret is using this pyramid efficiently. If you have limited resources, concentrate them where they matter most. Use weak labels to understand the overall problem space. Use better labels for common cases. Use gold labels for edge cases, safety-critical scenarios, and the most complex variations.
This strategy is particularly powerful for manipulation tasks where the distribution of difficulty is heavily skewed. Many demonstrations involve straightforward motions (the weak tier handles these adequately), but a few involve complex manipulation requiring nuance (these are where gold labels matter).
Addressing the Variability Problem
Real-world robotic tasks involve enormous variability. Dishwashers are different (top-loading vs. front-loading). Dishes vary (ceramics vs. glass, plates vs. bowls). Hands vary (human demonstrators differ in strength, coordination, hand size). Grippers vary (suction-based vs. parallel-jaw vs. multi-finger).
Traditional datasets often ignore this variability in pursuit of clean, controlled environments. But training on limited variability produces models that overfit to those specific conditions. A model trained on a single type of dishwasher with a single type of gripper won’t generalize to a different setup.
Better approaches systematically vary the conditions during data collection. Collect demonstrations from multiple human demonstrators. Collect the same task with different grippers. Include some demonstrations where conditions are non-ideal (dishes that are wet, grippers that are slightly worn). This synthetic variability is expensive but pays off in model robustness.
The Path From Data to Deployment
Collecting training data is just the beginning. The path from good training data to reliable deployed systems requires careful iteration. You need robust evaluation metrics that measure whether the robot can actually accomplish the task in novel environments, not just whether it can reproduce training demonstrations. You need feedback loops from deployment failures back to data collection, so you can identify failure modes and collect data to address them.
Organizations that excel at robotic manipulation do this systematically. They collect initial training data, deploy a system, observe failures, analyze those failures to identify which data would help, and then collect targeted data to address those specific challenges. This iterative process is where data becomes genuinely valuable.
Better Data for Better Robots
The robotics industry has historically underinvested in data collection and preparation relative to algorithm development. But as algorithms improve, the bottleneck has shifted. The difference between a robot that works in the lab and one that works in the real world is increasingly about training data quality, diversity, and richness.
Dishes, clothes, and countless other objects will continue to require subtle manipulation that simple pick-and-place algorithms can’t capture. Robots that solve these problems will be those trained on data that captures the actual physical complexity of the task.
Ready to train robots that actually manipulate the real world?
Schedule a robotics data consultation with BergLabs to design a data collection and labeling strategy for your specific manipulation tasks.

1 Comment
Ashton Porter
Nullam scelerisque massa vel augue placerat, a tempor sem egestas. Curabitur placerat finibus lacus.