🔬 Benchmarking Gr00t & Sim-to-Real on the SO100 Arm

See the SO100 arm performing picking red cube task.

“Open-source control models promise plug-and-play robotics — but do they really work off the shelf?”

At DL-RL I built an end-to-end pipeline to answer that exact question. We used the SO100 robotic arm, large synthetic datasets from Isaac Sim, and extensive fine-tuning and evaluation to measure how well open control models (e.g., Gr00t) actually perform in real-world deployment. The short answer: they can be powerful, but only when you understand how they were trained and when your fine-tuning, data, and evaluation are done carefully.

Perfect for:

Robotics researchers studying open-source action models
Teams evaluating open-source control models before production use
Anyone building reproducible benchmarks and practical training guides

Why this work matters

🔎 Benchmarking over hype

Many recently released control models are marketed as “off-the-shelf” solutions. Our work shows that realistic deployment requires more than downloading a checkpoint:

Hyperparameter sensitivity. Getting robust behavior often depends heavily on fine-tuning choices (learning rate schedules, batch sizes, augmentation, regularization, etc.). Small differences in training recipe can lead to large differences in real-world performance.
Dataset coverage matters. Models generalize well only when the training distribution reasonably covers the target tasks and edge cases. Rare motions, specific grasps, or unique environment lighting quickly reveal gaps.
Sim-to-real nuances. High-fidelity simulation reduces the gap but does not eliminate it; evaluation on hardware is essential. We run closed-loop tests to catch failure modes that never appear in sim.
Reproducibility is non-negotiable. To make benchmarking useful, we must publish training configs, random seeds, evaluation scripts, and dataset generation code — not just numbers.

Because of these realities, we treat this project as a benchmarking and transparency effort: not only to build a working system, but to document what actually works, what fails, and why.

Lessons & insights (high level)

Always version datasets + generation scripts + training configs together — models are meaningless without their data and recipe.
Run ablation studies on key hyperparameters and augmentations; some “default” settings break under real hardware noise.
Track metrics beyond task success (e.g., stability, recovery behavior, variance across seeds) to surface brittleness.
Continuous dataset expansion (captured real imagery + sim variations) materially improves robustness when incorporated into fine-tuning.

Achievements & deliverables

Generated and published large-scale synthetic datasets for the SO100 arm on Hugging Face — versioned for reproducibility and steady community use.
Fine-tuned Gr00t on combined simulated + real data, then deployed and evaluated on the physical SO100 arm. Closed-loop experiments achieved ~90–95% success rates on pick and pick-and-place tasks in our testbed.
Built a complete benchmarking suite (data generation → training → hardware evaluation) and documented the training recipes, hyperparameter sweeps, and failure cases.
Commitment to publication: we will release a detailed benchmarking report (code, configs, logs, and evaluation scripts) so the community knows what to expect from open control models and how to reproduce/improve our results.

How to access our work

Datasets and preliminary artifacts are available on Hugging Face.
The full benchmarking report, training configs, and evaluation scripts will be published alongside an open repository when the manuscript is released.

Our goal is to give researchers clear, reproducible guidance for using and evaluating open control models in real robots.