About Synthetic Data

Curriculum Learning

In robotics, control policies are typically trained on simple tasks first — like balancing — before progressing to more complex challenges such as walking or stair climbing. This approach, known as curriculum learning, helps models learn more effectively. Yet in computer vision, it's rarely applied, largely because it's hard to define how “difficult” an image is for a model to interpret.

Our datasets change that. Every image-label pair includes rich metadata that captures scene complexity — like object occlusion, position and orientation, lighting conditions, etc. — making it easy to structure a learning curriculum. Want to begin with clear, unobstructed views and gradually introduce clutter and occlusion? Done. Our synthetic data makes curriculum learning in vision not only possible — but practical.

Domain Randomization

To build models that generalize to the real world, it's critical to expose them to wide variations during training — a practice known as domain randomization. With synthetic data, you have full control: vary lighting, textures, object placement, camera angles, even add fog, dust, or simulated wear.

Want to train for warehouse conditions with poor lighting and dust buildup? Simulate it. Need your model to handle objects flipped, stacked, or partially occluded? No problem. Our datasets retain all randomization parameters, so you can filter, analyze, or subset based on them before and after training.

Related Papers

Synthetic Image Data for Deep Learning

Abstract—Realistic synthetic image data rendered from 3D models can be used to augment image sets and train image classification semantic segmentation models. In this work, we explore how high quality physically-based rendering and domain randomization can efficiently create a large synthetic dataset based on production 3D CAD models of a real vehicle. We use this dataset to quantify the effectiveness of synthetic augmentation using U-net and Double-U-net models. We found that, for this domain, synthetic images were an effective technique for augmenting limited sets of real training data. We observed that models trained on purely synthetic images had a very low mean prediction IoU on real validation images. We also observed that adding even very small amounts of real images to a synthetic dataset greatly improved accuracy, and that models trained on datasets augmented with synthetic images were more accurate than those trained on real images alone. Finally, we found that in use cases that benefit from incremen- tal training or model specialization, pretraining a base model on synthetic images provided a sizeable reduction in the training cost of transfer learning, allowing up to 90% of the model training to be front-loaded.

Domain randomization for transferring deep neural networks from simulation to the real world

Abstract—Bridging the ‘reality gap’ that separates simulated robotics from experiments on hardware could accelerate robotic research through improved data availability. This paper ex- plores domain randomization, a simple technique for training models on simulated images that transfer to real images by ran- domizing rendering in the simulator. With enough variability in the simulator, the real world may appear to the model as just another variation. We focus on the task of object localization, which is a stepping stone to general robotic manipulation skills. We find that it is possible to train a real-world object detector that is accurate to 1.5 cm and robust to distractors and partial occlusions using only data from a simulator with non-realistic random textures. To demonstrate the capabilities of our detectors, we show they can be used to perform grasping in a cluttered environment. To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images (without pre-training on real images) to the real world for the purpose of robotic control.

About us

We’re a team of engineers and roboticists with backgrounds in perception, control, mechanical design, and manufacturing. Time and again, we hit the same wall: the AI and reinforcement learning tools are here, ready to deploy, but the data wasn’t. Public datasets rarely fit the task, and collecting and labeling real-world data was slow, expensive, and painful.

So we built our own pipeline, one that generates exactly the data you need, in the context you need it.

Take your computer vision capabilities to new heights