Robot Training Data Infrastructure

Train Your Robots With Real Human Demonstrations

We sit on the world's largest untapped source of manipulation data — 63 million Indian MSMEs where skilled workers still do by hand what robots need to learn. Our ML pipeline turns that footage into RLDS-ready training data.

The ManuData Advantage
63M+
MSMEs across India — the world's largest hands-on manufacturing network
11.7M
Manufacturing Units
300M+
Workers Employed
30%
Of India's GDP
Task Diversity
1
CAPTURE
2
TRACK
3
SEGMENT
4
LOCALIZE
5
ANNOTATE
6
RECONSTRUCT
7
EXPORT RLDS
1
CAPTURE
2
TRACK
3
SEGMENT
4
LOCALIZE
5
ANNOTATE
6
RECONSTRUCT
7
EXPORT RLDS
The Unfair Advantage

63 Million Factories. One Data Pipeline.

While others capture data in labs or strap cameras on a few hundred workers, we have access to the most labor-intensive manufacturing economy on Earth.

India's MSMEs are the world's largest repository of human manipulation skill

India has 63.4 million registered MSMEs — of which 11.7 million are in manufacturing. These aren't automated factories. These are workshops where skilled workers hand-assemble electronics, weld metal, weave textiles, mold plastics, pack goods, sort components, and perform thousands of manipulation tasks every day — by hand.

This is exactly the data that robots need. Not staged demonstrations in clean labs. Not simulated physics. Real humans performing real tasks with real variability — different lighting, different objects, different hand sizes, real mistakes and real recoveries.

Why no one else can replicate this

Other data providers deploy cameras in a handful of partner facilities or rely on egocentric wearables that capture only the wearer's viewpoint with monocular depth estimation. We deploy multi-camera hardware depth rigs across a growing network that taps into India's MSME ecosystem — a sector employing over 300 million people across every conceivable manufacturing task.

The tasks that are still manual here are precisely the tasks robots struggle with most. The scale isn't hundreds of facilities. It's a pathway to millions. That's the ManuData moat.

The Industry Problem

Every Approach Has a Fatal Flaw

The robotics industry is spending billions on data that doesn't scale, doesn't transfer, or doesn't exist.

01 — TELEOPERATION

Expensive & Unscalable

Requires specialized hardware, trained operators, and lab environments. Produces clean data but at $100–500/hour. Most datasets are hundreds of hours, not the millions needed.

$100–500/hr
02 — EGOCENTRIC CAPTURE

Missing Dimensions

Head-mounted cameras capture first-person video at scale, but lack hardware depth, 6DoF object tracking, segmentation masks, and 3D scene reconstruction. Monocular depth can't match stereo hardware accuracy.

Incomplete data
03 — SIMULATION

The Sim-to-Real Gap

Synthetic data is infinitely scalable but suffers from unrealistic physics, limited task diversity, and artifacts that don't transfer to the messy real world. The gap remains the fundamental bottleneck.

Doesn't transfer
Our Solution

The Full Stack. Not Just Video.

Multi-camera hardware depth rigs + a 7-model ML pipeline. We don't just deliver video — we deliver structured training data ready for your robot.

Multi-View Hardware Depth Capture

01

3x Intel RealSense D455 (active IR stereo) + 1x ZED 2i (neural depth) per station. Overhead + dual 45 deg + hero front. Hardware-synced under 1ms. Real depth, not monocular estimation.

3D Body & Hand Pose

02

CLIFF extracts 33-joint SMPL body mesh in world coordinates. HaMeR recovers 21 joints per hand at less than 8mm error with ViT-H backbone. Full kinematic chain from spine to fingertips.

6DoF Object Pose Tracking

03

FoundationPose tracks every manipulated object — position + orientation in 3D space. Zero-shot generalization to novel objects with just a CAD file or reference photos. Under 5 deg rotation error.

Segmentation & Action Labels

04

SAM 2 provides per-pixel object masks tracked through occlusions. ActionFormer detects temporal boundaries — pick, place, screw, inspect — with 72% mAP. No manual annotation needed.

3D Scene Reconstruction

05

3D Gaussian Splatting builds photorealistic scene models for novel viewpoint synthesis at 100+ FPS. Generate robot-eye-view training data from any angle. 1000x faster than NeRF.

RLDS Native Export

06

Google's RLDS format — the standard used by Open X-Embodiment, RT-X, Octo, and OpenVLA. Plug directly into your imitation learning pipeline. Zero conversion overhead.

// RLDS Episode — Assembly task
{
  "episode": "bracket_assembly_001",
  "steps": 900,  // 30s @ 30fps
  "factory": "msme_pune_047",

  "step_0": {
    "image":     [1920,1080,3],  // RGB
    "depth":     [1280,800],     // HW stereo
    "body_pose": [33,3],        // SMPL
    "left_hand": [21,3],        // MANO
    "right_hand":[21,3],        // MANO
    "objects": {
      "bracket":    [x,y,z,qw,qx,qy,qz],
      "screwdriver":[x,y,z,qw,qx,qy,qz]
    },
    "masks":    // SAM 2 per-object
    "action":   "REACH",
    "language": "Pick up bracket and
                align with mounting hole"
  }
}
10x
Lower cost than US/EU data collection
100x
Task diversity — still manual in India
500+
Factory partners and growing
1M+
Hours scalable capacity
Technology

State-of-the-Art ML Pipeline

Seven models in concert — edge inference on NVIDIA Jetson, cloud-scale batch processing. Published at CVPR, ECCV, SIGGRAPH.

CLIFF · ECCV 2022

3D Body Pose

Full SMPL mesh with global position — not just relative joint angles.

Keypoints33 joints
Error52mm MPJPE
OutputSMPL params
HaMeR · CVPR 2024

Hand Reconstruction

ViT-H backbone for detailed 3D hand pose through self-occlusions.

Keypoints21 per hand
Error<8mm
Speed~30ms/hand
FoundationPose · CVPR 2024

6DoF Object Pose

Zero-shot on novel objects — CAD or reference photos only.

Rotation<5°
Translation12mm
Speed30+ FPS
SAM 2 · Meta 2024

Video Segmentation

Per-pixel masks through full video with occlusion memory.

IoU>0.92
Speed30+ FPS
Data600K+ masks
ActionFormer · ECCV 2022

Action Detection

Temporal transformer finds action boundaries — anchor-free.

mAP@0.572%
TypeTransformer
InferenceReal-time
3DGS · SIGGRAPH 2023

Scene Reconstruction

Novel viewpoint synthesis at 100+ FPS. Digital twins for every station.

PSNR31.2 dB
Rendering100+ FPS
vs NeRF1000x faster
Intel RealSense D455 x3
Active IR Stereo · 1280x800 depth
Stereolabs ZED 2i
Neural depth · IP66 · 4MP
NVIDIA Jetson Orin NX
100 TOPS edge compute
$2,500–4,000
Total hardware per station
Quality

Research-Grade Accuracy

Multi-layer QA: automated confidence checks, geometric cross-validation, temporal smoothing, and 5% human review on every batch.

52
mm MPJPE
Body Pose
7.8
mm Mean Error
Hand Pose
4.2°
Rotation Error
Object Pose
0.93
IoU Score
Segmentation
Comparison

Not All Data Is Equal

A structured look at what different data collection approaches actually deliver to your training pipeline.

Capability Teleoperation Egocentric Capture Simulation ManuData
Data SourceLab operators w/ VRHead-mounted camerasSoftware-generatedMulti-cam rigs in real MSME factories
Depth Data Teleop sensors Monocular estimation Perfect (synthetic) Hardware stereo + active IR
3D Body Pose Not included Varies Synthetic CLIFF — 33 joints, SMPL
3D Hand Tracking Controller only Rarely offered Limited HaMeR — 21/hand, <8mm
6DoF Object Pose Not tracked Not offered Synthetic FoundationPose — <5°
Segmentation Masks Synthetic SAM 2 — >0.92 IoU
Action LabelsManual (expensive) Task labelsScripted ActionFormer — auto
3D Scene Reconstruction Native 3DGS — 100+ FPS
Real Environments Lab only Real factories Synthetic 500+ factories (63M accessible)
Task DiversityLowMediumLow (scripted)Massive — labor-intensive economy
Scalability100s hrs10K–30K hrsInfinite (synthetic)1M+ hours capacity
Output FormatProprietaryStandardizedVariesRLDS native — RT-X / Octo / OpenVLA
Data ProvenanceUS/EU or ChinaVariesN/AIndia — geopolitically neutral
Cost$100–500/hrNot publishedCompute only$50–500/hr (fully processed)
Data Tiers

Choose Your Data Depth

From raw multi-view video to fully annotated manipulation trajectories with language instructions and force data.

Bronze
$50
per hour
  • Multi-view RGB video (1080p, 30fps)
  • 33-keypoint body pose (SMPL)
  • Camera calibration data
  • Basic metadata
Get Started
Silver
$150
per hour
  • Everything in Bronze
  • Hardware depth maps (stereo + IR)
  • 21-keypoint hand tracking per hand
  • Multi-view fusion
Get Started
Platinum
$500
per hour
  • Everything in Gold
  • Natural language instructions
  • Force sensing data
  • Custom task taxonomies
  • Dedicated factory capacity
Get Started
Integration

From Discovery to Data in 5 Weeks

Week 1

Discovery

Define task requirements, object categories, and format specs.

Week 2

Deployment

Install capture rigs at MSME partners matched to your use case.

Week 2–3

Capture

Record thousands of manipulation episodes across diverse environments.

Week 4

Processing

ML pipeline extracts poses, segmentations, actions, and scene models.

Week 5

Delivery

RLDS-formatted data ready for RT-X, Octo, or your custom pipeline.

Start With 1,000 Hours Free

We'll deliver a pilot dataset at no cost to prove data quality and compatibility with your training stack. Non-binding LOI — no commitment.

Letter of Intent

Let's Build Your Training Dataset

Sign a non-binding Letter of Intent and we'll begin your free 1,000-hour pilot. No cost, no commitment — just data to evaluate.

  • 1,000 hours of processed robot training data at no cost
  • Multi-view RGB + hardware depth + 3D body and hand pose
  • 6DoF object tracking + semantic segmentation masks
  • Temporal action labels + natural language descriptions
  • RLDS / Open X-Embodiment compatible format
  • Delivery within 90 days of LOI execution
  • You retain full usage rights — ManuData retains pipeline IP only
  • Privacy-first: face blurring, data sovereignty, factory-specific NDAs