Robot Training Data Infrastructure

Train Your Robots With Real Human Demonstrations

We sit on the world's largest untapped source of manipulation data — 63 million Indian MSMEs where skilled workers still do by hand what robots need to learn. Our ML pipeline turns that footage into RLDS-ready training data.

Request 1,000hr Free Pilot → See Our Edge

The ManuData Advantage

63M+

MSMEs across India — the world's largest hands-on manufacturing network

11.7M

Manufacturing Units

300M+

Workers Employed

30%

Of India's GDP

∞

Task Diversity

The Unfair Advantage

63 Million Factories. One Data Pipeline.

While others capture data in labs or strap cameras on a few hundred workers, we have access to the most labor-intensive manufacturing economy on Earth.

India's MSMEs are the world's largest repository of human manipulation skill

India has 63.4 million registered MSMEs — of which 11.7 million are in manufacturing. These aren't automated factories. These are workshops where skilled workers hand-assemble electronics, weld metal, weave textiles, mold plastics, pack goods, sort components, and perform thousands of manipulation tasks every day — by hand.

This is exactly the data that robots need. Not staged demonstrations in clean labs. Not simulated physics. Real humans performing real tasks with real variability — different lighting, different objects, different hand sizes, real mistakes and real recoveries.

Why no one else can replicate this

Other data providers deploy cameras in a handful of partner facilities or rely on egocentric wearables that capture only the wearer's viewpoint with monocular depth estimation. We deploy multi-camera hardware depth rigs across a growing network that taps into India's MSME ecosystem — a sector employing over 300 million people across every conceivable manufacturing task.

The tasks that are still manual here are precisely the tasks robots struggle with most. The scale isn't hundreds of facilities. It's a pathway to millions. That's the ManuData moat.

The Industry Problem

Every Approach Has a Fatal Flaw

The robotics industry is spending billions on data that doesn't scale, doesn't transfer, or doesn't exist.

01 — TELEOPERATION

Expensive & Unscalable

Requires specialized hardware, trained operators, and lab environments. Produces clean data but at $100–500/hour. Most datasets are hundreds of hours, not the millions needed.

$100–500/hr

02 — EGOCENTRIC CAPTURE

Missing Dimensions

Head-mounted cameras capture first-person video at scale, but lack hardware depth, 6DoF object tracking, segmentation masks, and 3D scene reconstruction. Monocular depth can't match stereo hardware accuracy.

Incomplete data

03 — SIMULATION

The Sim-to-Real Gap

Synthetic data is infinitely scalable but suffers from unrealistic physics, limited task diversity, and artifacts that don't transfer to the messy real world. The gap remains the fundamental bottleneck.

Doesn't transfer

Our Solution

The Full Stack. Not Just Video.

Multi-camera hardware depth rigs + a 7-model ML pipeline. We don't just deliver video — we deliver structured training data ready for your robot.

Multi-View Hardware Depth Capture

3x Intel RealSense D455 (active IR stereo) + 1x ZED 2i (neural depth) per station. Overhead + dual 45 deg + hero front. Hardware-synced under 1ms. Real depth, not monocular estimation.

3D Body & Hand Pose

CLIFF extracts 33-joint SMPL body mesh in world coordinates. HaMeR recovers 21 joints per hand at less than 8mm error with ViT-H backbone. Full kinematic chain from spine to fingertips.

6DoF Object Pose Tracking

FoundationPose tracks every manipulated object — position + orientation in 3D space. Zero-shot generalization to novel objects with just a CAD file or reference photos. Under 5 deg rotation error.

Segmentation & Action Labels

SAM 2 provides per-pixel object masks tracked through occlusions. ActionFormer detects temporal boundaries — pick, place, screw, inspect — with 72% mAP. No manual annotation needed.

3D Scene Reconstruction

3D Gaussian Splatting builds photorealistic scene models for novel viewpoint synthesis at 100+ FPS. Generate robot-eye-view training data from any angle. 1000x faster than NeRF.

RLDS Native Export

Google's RLDS format — the standard used by Open X-Embodiment, RT-X, Octo, and OpenVLA. Plug directly into your imitation learning pipeline. Zero conversion overhead.

// RLDS Episode — Assembly task
{
  "episode": "bracket_assembly_001",
  "steps": 900,  // 30s @ 30fps
  "factory": "msme_pune_047",

  "step_0": {
    "image":     [1920,1080,3],  // RGB
    "depth":     [1280,800],     // HW stereo
    "body_pose": [33,3],        // SMPL
    "left_hand": [21,3],        // MANO
    "right_hand":[21,3],        // MANO
    "objects": {
      "bracket":    [x,y,z,qw,qx,qy,qz],
      "screwdriver":[x,y,z,qw,qx,qy,qz]
    },
    "masks":    // SAM 2 per-object
    "action":   "REACH",
    "language": "Pick up bracket and
                align with mounting hole"
  }
}

10x

Lower cost than US/EU data collection

100x

Task diversity — still manual in India

500+

Factory partners and growing

1M+

Hours scalable capacity

Technology

State-of-the-Art ML Pipeline

Seven models in concert — edge inference on NVIDIA Jetson, cloud-scale batch processing. Published at CVPR, ECCV, SIGGRAPH.

CLIFF · ECCV 2022

3D Body Pose

Full SMPL mesh with global position — not just relative joint angles.

Keypoints33 joints

Error52mm MPJPE

OutputSMPL params

HaMeR · CVPR 2024

Hand Reconstruction

ViT-H backbone for detailed 3D hand pose through self-occlusions.

Keypoints21 per hand

Error<8mm

Speed~30ms/hand

FoundationPose · CVPR 2024

6DoF Object Pose

Zero-shot on novel objects — CAD or reference photos only.

Rotation<5°

Translation12mm

Speed30+ FPS

SAM 2 · Meta 2024

Video Segmentation

Per-pixel masks through full video with occlusion memory.

IoU>0.92

Speed30+ FPS

Data600K+ masks

ActionFormer · ECCV 2022

Action Detection

Temporal transformer finds action boundaries — anchor-free.

mAP@0.572%

TypeTransformer

InferenceReal-time

3DGS · SIGGRAPH 2023

Scene Reconstruction

Novel viewpoint synthesis at 100+ FPS. Digital twins for every station.

PSNR31.2 dB

Rendering100+ FPS

vs NeRF1000x faster

Intel RealSense D455 x3

Active IR Stereo · 1280x800 depth

Stereolabs ZED 2i

Neural depth · IP66 · 4MP

NVIDIA Jetson Orin NX

100 TOPS edge compute

$2,500–4,000

Total hardware per station

Comparison

Not All Data Is Equal

A structured look at what different data collection approaches actually deliver to your training pipeline.

Capability	Teleoperation	Egocentric Capture	Simulation	ManuData
Data Source	Lab operators w/ VR	Head-mounted cameras	Software-generated	Multi-cam rigs in real MSME factories
Depth Data	✓ Teleop sensors	△ Monocular estimation	✓ Perfect (synthetic)	✓ Hardware stereo + active IR
3D Body Pose	✗ Not included	✓ Varies	✓ Synthetic	✓ CLIFF — 33 joints, SMPL
3D Hand Tracking	✗ Controller only	△ Rarely offered	△ Limited	✓ HaMeR — 21/hand, <8mm
6DoF Object Pose	✗ Not tracked	✗ Not offered	✓ Synthetic	✓ FoundationPose — <5°
Segmentation Masks	✗	✗	✓ Synthetic	✓ SAM 2 — >0.92 IoU
Action Labels	Manual (expensive)	✓ Task labels	Scripted	✓ ActionFormer — auto
3D Scene Reconstruction	✗	✗	✓ Native	✓ 3DGS — 100+ FPS
Real Environments	✗ Lab only	✓ Real factories	✗ Synthetic	✓ 500+ factories (63M accessible)
Task Diversity	Low	Medium	Low (scripted)	Massive — labor-intensive economy
Scalability	100s hrs	10K–30K hrs	Infinite (synthetic)	1M+ hours capacity
Output Format	Proprietary	Standardized	Varies	RLDS native — RT-X / Octo / OpenVLA
Data Provenance	US/EU or China	Varies	N/A	India — geopolitically neutral
Cost	$100–500/hr	Not published	Compute only	$50–500/hr (fully processed)

Data Tiers

Choose Your Data Depth

From raw multi-view video to fully annotated manipulation trajectories with language instructions and force data.

Bronze

$50

per hour

Multi-view RGB video (1080p, 30fps)
33-keypoint body pose (SMPL)
Camera calibration data
Basic metadata

Get Started

Silver

$150

per hour

Everything in Bronze
Hardware depth maps (stereo + IR)
21-keypoint hand tracking per hand
Multi-view fusion

Get Started

Gold — Most Popular

$300

per hour

Everything in Silver
6DoF object pose tracking
Semantic segmentation masks
Temporal action segmentation
3D Gaussian Splatting scene

Get Started

Platinum

$500

per hour

Everything in Gold
Natural language instructions
Force sensing data
Custom task taxonomies
Dedicated factory capacity

Get Started

Integration

From Discovery to Data in 5 Weeks

Week 1

Discovery

Define task requirements, object categories, and format specs.

Week 2

Deployment

Install capture rigs at MSME partners matched to your use case.

Week 2–3

Capture

Record thousands of manipulation episodes across diverse environments.

Week 4

Processing

ML pipeline extracts poses, segmentations, actions, and scene models.

Week 5

Delivery

RLDS-formatted data ready for RT-X, Octo, or your custom pipeline.

Letter of Intent

Let's Build Your Training Dataset

Sign a non-binding Letter of Intent and we'll begin your free 1,000-hour pilot. No cost, no commitment — just data to evaluate.

1,000 hours of processed robot training data at no cost
Multi-view RGB + hardware depth + 3D body and hand pose
6DoF object tracking + semantic segmentation masks
Temporal action labels + natural language descriptions
RLDS / Open X-Embodiment compatible format
Delivery within 90 days of LOI execution
You retain full usage rights — ManuData retains pipeline IP only
Privacy-first: face blurring, data sovereignty, factory-specific NDAs

Train Your Robots With Real Human Demonstrations

63 Million Factories. One Data Pipeline.

India's MSMEs are the world's largest repository of human manipulation skill

Why no one else can replicate this

Every Approach Has a Fatal Flaw

Expensive & Unscalable

Missing Dimensions

The Sim-to-Real Gap

The Full Stack. Not Just Video.

Multi-View Hardware Depth Capture

3D Body & Hand Pose

6DoF Object Pose Tracking

Segmentation & Action Labels

3D Scene Reconstruction

RLDS Native Export

State-of-the-Art ML Pipeline

3D Body Pose

Hand Reconstruction

6DoF Object Pose

Video Segmentation

Action Detection

Scene Reconstruction

Research-Grade Accuracy

Not All Data Is Equal

Choose Your Data Depth

From Discovery to Data in 5 Weeks

Discovery

Deployment

Capture

Processing

Delivery

Start With 1,000 Hours Free

Let's Build Your Training Dataset