Robotics & Physical AI medium · first-party

Qwen-RobotManip

Alibaba's first robotics models include one manipulation generalist that topped a third-party real-robot leaderboard — beating rivals that field a separate tuned model per task.

repo Alibaba / Qwen · 2 min read · originally announced 16 Jun 2026

Alibaba's Qwen team shipped three robotics models at once — one for moving through space, one for handling objects, one for predicting what happens next on video — all built on the same vision-language model that already powers its chatbots. The manipulation model is the headline: a single set of weights that drives 15 different robot bodies, from single arms to humanoids, and on RoboChallenge Table30 — an independent benchmark run on real hardware across four robot platforms — it finished first among generalists, roughly a fifth ahead of the previous best.

No proprietary robot fleet — about 38,000 hours of data, all public, scraped from open datasets and human video, topped a real-robot leaderboard against models trained on private fleets.

What makes that placement unusual is who it beat and how. The leading Western embodied models from NVIDIA and Physical Intelligence post their top numbers by fine-tuning a fresh specialist for each task; Qwen entered one model for everything. And it learned entirely from public data — open robot datasets, egocentric human video, synthetic demonstrations — about 38,000 hours of it, with no proprietary robot fleet behind it. The team's wager, stated in the report's title, is that the bottleneck in robot learning isn't collecting more data but aligning the heterogeneous data already lying around into one frame.

The most candid finding is buried in the same reports: on standard in-distribution tests, a model with no robotics pretraining at all scored about the same as a fully pretrained one — so the benchmarks everyone cites barely measure the thing they claim to. The team pivoted to out-of-distribution evaluation to find any signal. That a maker topping the leaderboards will also tell you the leaderboards are half theater is the more useful disclosure here. If the open-data thesis holds, hardware makers get a software brain they can adopt without building a data pipeline first — the part that has gated everyone else.

Want to try it?

Two of the three models are public code at github.com/QwenLM/Qwen-VLA — start with the RobotManip repo and its technical report, which is where the 'alignment unlocks scale' training recipe is spelled out.

Open the repo at github.com →

The lenses

Novelty 4

Impact · breadth 3

Impact · depth 4

Actionable 4

Substance 4

Hype 2

The facts

What shipsTwo of the three models (manipulation, navigation) release as public code on GitHub; the video world-model is blog-only

Training data~38,000 hours, entirely public — open robot datasets, human video, synthetic demos; no proprietary fleet

Result#1 generalist on the third-party RoboChallenge Table30 real-robot benchmark, one model across 15 robot platforms

Concepts

Vision-language model AI benchmarks World model Humanoid robot

Open github.com →

How this connects

Tap a node to open it

Qwen-RobotManip

The lenses

The facts

Concepts

More in Robotics & Physical AI

NEO

Breaking the cage

The humanoid capital squeeze

How this connects