RoboSeek:

You Need to Interact with Your Objects

Yibo Peng1,2,5*, Jiahao Yang1*, Shenhao Yan1,3, Ziyu Huang1,2, Shuang Li5, Shuguang Cui1,2,
Yiming Zhao1,4,5†, Yatong Han1,5†

1 FNii-Shenzhen     2 The Chinese University of Hong Kong, Shenzhen 3 Northeastern University
4 Harbin Engineering University     5 Infused Synapse AI

* Equal Contribution.    † Corresponding authors: yiming_zhao@hrbeu.edu.cn, hanyatong@cuhk.edu.cn

📄 Paper 📝 arXiv 💻 Code (Released before 10.15)

Abstract

Optimizing and refining action execution through exploration and interaction is a promising way for robotic manipulation. However, practical approaches to interaction-driven robotic learning are still underexplored, particularly for long-horizon tasks where sequential decision-making, physical constraints, and perceptual uncertainties pose significant challenges. Motivated by embodied cognition theory, we propose RoboSeek, a framework for embodied action execution that leverages interactive experience to accomplish manipulation tasks. RoboSeek optimizes prior knowledge from high-level perception models through closed-loop training in simulation and achieves robust real-world execution via a real2sim2real transfer pipeline. Specifically, we first replicate real-world environments in simulation using 3D reconstruction to provide visually and physically consistent environments, then we train policies in simulation using reinforcement learning and the cross-entropy method leveraging visual priors. The learned policies are subsequently deployed on real robotic platforms for execution. RoboSeek is hardware-agnostic and is evaluated on multiple robotic platforms across eight long-horizon manipulation tasks involving sequential interactions, tool use, and object handling. Our approach achieves an average success rate of 79%, significantly outperforming baselines whose success rates remain below 50%, highlighting its generalization and robustness across tasks and platforms. Experimental results validate the effectiveness of our training framework in complex, dynamic real-world settings and demonstrate the stability of the proposed real2sim2real transfer mechanism, paving the way for more generalizable embodied robotic learning.

Method Overview

RoboSeek Framework

Figure 2. Overview of the RoboSeek method.

Given visual priors from high-level perception model, we construct a potential attention space. Then we leverage reinforcement learning in simulation with a transformer policy to learn an embodied executor in the attention space. Furthermore, we adopt the cross-entropy method to refine the attention space. Our real2sim2real pipeline achieves robust and stable control in long-horizon, complex manipulation tasks, highlighting its generalization ability across diverse robotic platforms.

Results

Roboseek achieves an average success rate of 79%, significantly outperforming baselines whose success rates remain below 50%. Experimental results demonstrate that RoboSeek is capable of handling long-horizon, complex manipulation tasks with high stability and robustness, and validate the effectiveness of our training framework.

Result 1

Success rates for different tasks using RoboSeek and Baselines.

Result 2

Success rates of our method on real-robot domestic tasks. Each task is run for 20 trials.

Real-World Demonstrations

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8