We sit on the world's largest untapped source of manipulation data — 63 million Indian MSMEs where skilled workers still do by hand what robots need to learn. Our ML pipeline turns that footage into RLDS-ready training data.
While others capture data in labs or strap cameras on a few hundred workers, we have access to the most labor-intensive manufacturing economy on Earth.
India has 63.4 million registered MSMEs — of which 11.7 million are in manufacturing. These aren't automated factories. These are workshops where skilled workers hand-assemble electronics, weld metal, weave textiles, mold plastics, pack goods, sort components, and perform thousands of manipulation tasks every day — by hand.
This is exactly the data that robots need. Not staged demonstrations in clean labs. Not simulated physics. Real humans performing real tasks with real variability — different lighting, different objects, different hand sizes, real mistakes and real recoveries.
Other data providers deploy cameras in a handful of partner facilities or rely on egocentric wearables that capture only the wearer's viewpoint with monocular depth estimation. We deploy multi-camera hardware depth rigs across a growing network that taps into India's MSME ecosystem — a sector employing over 300 million people across every conceivable manufacturing task.
The tasks that are still manual here are precisely the tasks robots struggle with most. The scale isn't hundreds of facilities. It's a pathway to millions. That's the ManuData moat.
The robotics industry is spending billions on data that doesn't scale, doesn't transfer, or doesn't exist.
Requires specialized hardware, trained operators, and lab environments. Produces clean data but at $100–500/hour. Most datasets are hundreds of hours, not the millions needed.
Head-mounted cameras capture first-person video at scale, but lack hardware depth, 6DoF object tracking, segmentation masks, and 3D scene reconstruction. Monocular depth can't match stereo hardware accuracy.
Synthetic data is infinitely scalable but suffers from unrealistic physics, limited task diversity, and artifacts that don't transfer to the messy real world. The gap remains the fundamental bottleneck.
Multi-camera hardware depth rigs + a 7-model ML pipeline. We don't just deliver video — we deliver structured training data ready for your robot.
3x Intel RealSense D455 (active IR stereo) + 1x ZED 2i (neural depth) per station. Overhead + dual 45 deg + hero front. Hardware-synced under 1ms. Real depth, not monocular estimation.
CLIFF extracts 33-joint SMPL body mesh in world coordinates. HaMeR recovers 21 joints per hand at less than 8mm error with ViT-H backbone. Full kinematic chain from spine to fingertips.
FoundationPose tracks every manipulated object — position + orientation in 3D space. Zero-shot generalization to novel objects with just a CAD file or reference photos. Under 5 deg rotation error.
SAM 2 provides per-pixel object masks tracked through occlusions. ActionFormer detects temporal boundaries — pick, place, screw, inspect — with 72% mAP. No manual annotation needed.
3D Gaussian Splatting builds photorealistic scene models for novel viewpoint synthesis at 100+ FPS. Generate robot-eye-view training data from any angle. 1000x faster than NeRF.
Google's RLDS format — the standard used by Open X-Embodiment, RT-X, Octo, and OpenVLA. Plug directly into your imitation learning pipeline. Zero conversion overhead.
// RLDS Episode — Assembly task { "episode": "bracket_assembly_001", "steps": 900, // 30s @ 30fps "factory": "msme_pune_047", "step_0": { "image": [1920,1080,3], // RGB "depth": [1280,800], // HW stereo "body_pose": [33,3], // SMPL "left_hand": [21,3], // MANO "right_hand":[21,3], // MANO "objects": { "bracket": [x,y,z,qw,qx,qy,qz], "screwdriver":[x,y,z,qw,qx,qy,qz] }, "masks": // SAM 2 per-object "action": "REACH", "language": "Pick up bracket and align with mounting hole" } }
Seven models in concert — edge inference on NVIDIA Jetson, cloud-scale batch processing. Published at CVPR, ECCV, SIGGRAPH.
Full SMPL mesh with global position — not just relative joint angles.
ViT-H backbone for detailed 3D hand pose through self-occlusions.
Zero-shot on novel objects — CAD or reference photos only.
Per-pixel masks through full video with occlusion memory.
Temporal transformer finds action boundaries — anchor-free.
Novel viewpoint synthesis at 100+ FPS. Digital twins for every station.
Multi-layer QA: automated confidence checks, geometric cross-validation, temporal smoothing, and 5% human review on every batch.
A structured look at what different data collection approaches actually deliver to your training pipeline.
| Capability | Teleoperation | Egocentric Capture | Simulation | ManuData |
|---|---|---|---|---|
| Data Source | Lab operators w/ VR | Head-mounted cameras | Software-generated | Multi-cam rigs in real MSME factories |
| Depth Data | ✓ Teleop sensors | △ Monocular estimation | ✓ Perfect (synthetic) | ✓ Hardware stereo + active IR |
| 3D Body Pose | ✗ Not included | ✓ Varies | ✓ Synthetic | ✓ CLIFF — 33 joints, SMPL |
| 3D Hand Tracking | ✗ Controller only | △ Rarely offered | △ Limited | ✓ HaMeR — 21/hand, <8mm |
| 6DoF Object Pose | ✗ Not tracked | ✗ Not offered | ✓ Synthetic | ✓ FoundationPose — <5° |
| Segmentation Masks | ✗ | ✗ | ✓ Synthetic | ✓ SAM 2 — >0.92 IoU |
| Action Labels | Manual (expensive) | ✓ Task labels | Scripted | ✓ ActionFormer — auto |
| 3D Scene Reconstruction | ✗ | ✗ | ✓ Native | ✓ 3DGS — 100+ FPS |
| Real Environments | ✗ Lab only | ✓ Real factories | ✗ Synthetic | ✓ 500+ factories (63M accessible) |
| Task Diversity | Low | Medium | Low (scripted) | Massive — labor-intensive economy |
| Scalability | 100s hrs | 10K–30K hrs | Infinite (synthetic) | 1M+ hours capacity |
| Output Format | Proprietary | Standardized | Varies | RLDS native — RT-X / Octo / OpenVLA |
| Data Provenance | US/EU or China | Varies | N/A | India — geopolitically neutral |
| Cost | $100–500/hr | Not published | Compute only | $50–500/hr (fully processed) |
From raw multi-view video to fully annotated manipulation trajectories with language instructions and force data.
Define task requirements, object categories, and format specs.
Install capture rigs at MSME partners matched to your use case.
Record thousands of manipulation episodes across diverse environments.
ML pipeline extracts poses, segmentations, actions, and scene models.
RLDS-formatted data ready for RT-X, Octo, or your custom pipeline.
We'll deliver a pilot dataset at no cost to prove data quality and compatibility with your training stack. Non-binding LOI — no commitment.
Sign a non-binding Letter of Intent and we'll begin your free 1,000-hour pilot. No cost, no commitment — just data to evaluate.