Two New Papers Show How to Fix Robot Policies Without Starting Over
FlowPRO and EVE tackle the same problem from opposite directions: making robot learning actually work outside the lab.
By
Training a robot to do something in a lab is one thing. Getting that same robot to keep working when reality gets messy is, honestly, where most approaches fall apart.
Two papers dropped this week that attack this problem from completely different angles, and I think they're worth looking at together because they reveal something interesting about where the field is headed.
The core problem both papers are solving is what happens after you've trained a vision-language-action model. These VLAs (the things that let robots see, understand language commands, and actually move) work pretty well in controlled settings. But deploy them on a real robot, and they start failing in ways that are expensive to fix. The traditional answer has been to collect more data and retrain, which is slow and costly.
arXiv published FlowPRO this week, and it takes what I'd call the "teach from corrections" approach. The setup is clever: a human operator watches the robot attempt a task, and when it screws up, they intervene with a correction. That single correction creates a natural pair of data, the wrong thing the robot was about to do and the right thing it should have done instead.
What makes this interesting is that it's reward-free. If you've followed RL in robotics at all, you know that designing reward functions for real-world tasks is, tbh, kind of a nightmare. FlowPRO sidesteps this entirely by using preference optimization. The robot learns that trajectory A (what the human did) is better than trajectory B (what it was about to do) without needing to assign specific numerical rewards.
The technical contribution here is something called RPRO, which adds a regularizer to prevent what the authors call "reward hacking." I should know this better, but my understanding is that without this anchor, the model can find degenerate solutions that technically satisfy the preference objective but don't actually produce useful behavior. The regularizer keeps the learned preferences grounded.
Related coverage
More in AI Models
Researchers tackle two of the biggest blockers for vision-language-action models in production: unsafe navigation around people, and inference speeds too slow for real-time control.
James Chen · 4 hours ago · 6 min
Super Micro Computer's plan to raise $7 billion through equity offerings to stock up on AI server components says something interesting about where the industry thinks this is all heading.
Sarah Williams · 6 hours ago · 5 min
Two stories about data center infrastructure landed this week, and together they say something uncomfortable about where AI's energy appetite is taking us.
Aisha Patel · 9 hours ago · 8 min


