Qwen-RobotManip
Alibaba's first robotics models include one manipulation generalist that topped a third-party real-robot leaderboard — beating rivals that field a separate tuned model per task.
Alibaba's Qwen team shipped three robotics models at once — one for moving through space, one for handling objects, one for predicting what happens next on video — all built on the same vision-language model that already powers its chatbots. The manipulation model is the headline: a single set of weights that drives 15 different robot bodies, from single arms to humanoids, and on RoboChallenge Table30 — an independent benchmark run on real hardware across four robot platforms — it finished first among generalists, roughly a fifth ahead of the previous best.
No proprietary robot fleet — about 38,000 hours of data, all public, scraped from open datasets and human video, topped a real-robot leaderboard against models trained on private fleets.
What makes that placement unusual is who it beat and how. The leading Western embodied models from NVIDIA and Physical Intelligence post their top numbers by fine-tuning a fresh specialist for each task; Qwen entered one model for everything. And it learned entirely from public data — open robot datasets, egocentric human video, synthetic demonstrations — about 38,000 hours of it, with no proprietary robot fleet behind it. The team's wager, stated in the report's title, is that the bottleneck in robot learning isn't collecting more data but aligning the heterogeneous data already lying around into one frame.
The most candid finding is buried in the same reports: on standard in-distribution tests, a model with no robotics pretraining at all scored about the same as a fully pretrained one — so the benchmarks everyone cites barely measure the thing they claim to. The team pivoted to out-of-distribution evaluation to find any signal. That a maker topping the leaderboards will also tell you the leaderboards are half theater is the more useful disclosure here. If the open-data thesis holds, hardware makers get a software brain they can adopt without building a data pipeline first — the part that has gated everyone else.
Two of the three models are public code at github.com/QwenLM/Qwen-VLA — start with the RobotManip repo and its technical report, which is where the 'alignment unlocks scale' training recipe is spelled out.
Open the repo at github.com →