Perception2Cognition, Physical Intelligence Lab

​We study multimodal intelligence that interacts with simulated and physical worlds through language–cognition–action. Our core values are Multimodal Learning & Reasoning (Multimodal Perception / Commonsense Reasoning)Human–Robot Interaction (HRI)World Models, and AI Safety & Responsibility (Alignment & Safety). We also emphasize AI for Science & Education as a catalyst for interdisciplinary impact, building AI systems that contribute to scientific discovery and real-world problem solving. Our long-term goal is to create AI that understands and collaborates with humans in a way that approaches human intelligence.


1) Multimodal Perception & Commonsense Reasoning.

We study the lightweighting and performance improvement of multimodal generative models and multimodal LLMs, and we train a single model to learn heterogeneous modalities—natural language, vision, audio, documents, actions, and UI—while improving reasoning quality in interactive settings.

  1. Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists? (ACL 2025)
  2. Don’t Look Only Once: Multimodal Interactive Reasoning with Selective Visual Revisitation (2025)
  3. Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation (EMNLP 2025)
  4. MASS: Overcoming Language Bias in Image-Text Matching (AAAI2025)
  5. V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models (ICCV 2025)
  6. Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! (EMNLP 2024)
  7. Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding (EMNLP 2024)


2) Embodied AI, HRI, World Models

We train agents that tie together instruction understanding, perception, and action. We address safe action planning for Omni LLMs and Vision–Language–Action models, language-instruction–based navigation, and egocentric interaction. We also model personality, affect, and nonverbal cues to build human-friendly agents capable of communicating with people. By combining reinforcement learning and psychology, we learn stable behavioral traits.

  1. VisEscape: Exploration-driven Decision-making in Virtual Escape Rooms (EMNLP 2025)
  2. CANVAS: Commonsense-Aware Navigation System for Intuitive HRI (ICRA 2025)
  3. EgoSpeak: LearningWhen to Speak for Egocentric Conversational Agents (NAACL 2025)
  4. GuideDog: Egocentric Multimodal Dataset for Accessibility-Aware Guidance (2025)
  5. Persona Dynamics: Personality Traits in Text-Based Agents (ACL 2025)
  6. TRAIT: Psychometrics-grounded LLM Personality Testset (NAACL 2025)
  7. DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding (ICCV 2025)


3) Safety, Alignment & Responsible AI

We build models robust to perturbations in multimodal inputs such as language, speech, and vision, and we study interpretable AI, verification reliability, watermarking, and other methods for using AI safely.

  1. Representation Bending for LLM Safety (ACL 2025)
  2. G1yphD3c0de: Safer LMs on Visually Perturbed Texts (COLM 2025)
  3. Verifying the Verifiers: Pitfalls & Potentials in Fact Verifiers (COLM 2025)
  4. KL Penalty Control via Perturbation for DPO (2025)
  5. Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
  6. Subtle Risks, Critical Failures: Diagnosing Physical Safety for Embodied LLMs (EMNLP 2025)


4) AI for Science & Education

We study scientific assistant AI agents that support research and education, enable hypothesis formation, experimentation, and evidence verification, and acquire new scientific knowledge. We aim for research that can drive innovation through interdisciplinary collaboration across diverse fields.

  1. When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research (2025)
  2. C²: Scalable Auto-Feedback for LLM-based Chart Generation (NAACL2025)
  3. Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation (2025)
  4. Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation (MICCAI 2025)