Modular and Interpretable Multimodal AI for Improved Generation and Evaluation

Community
arrow_forward_ios
Seminars

Seminars

Name: 조재민(Jaemin Cho)

Affiliation:

UNC Chapel Hill

Host: 김건희 교수

Date: 7/10/2025 오후 04:00 - 오후 06:00

Location: 302동 308호

Summary

The paradigm of training large-scale foundation models has driven significant advancements in multimodal AI. However, pursuing further performance gains solely through model scaling is becoming impractical due to rising computational costs and resource limitations. Moreover, the reasoning and generation processes of these models remain mostly uninterpretable and uncontrollable, often leading to unfaithful outputs. In this talk, I will discuss my efforts to make multimodal generative models more controllable and trustworthy without increasing their size. First, I will introduce faithful reasoning frameworks, where the multimodal generation process mirrors how humans reason about and create content such as images and videos. Concretely, in these frameworks, models create a detailed plan that decomposes a complex generation task into simpler steps as well as retrieve relevant information from multimodal knowledge bases before generating the final outputs. Next, I will describe fine-grained evaluation methods that assess model capabilities across multiple dimensions, such as object counting and spatial relation understanding, thereby providing a detailed understanding of the strengths and weaknesses of models. In turn, these evaluations enable targeted model improvements that address identified weaknesses through test-time guidance or by updating training environments. Together, these directions offer a pathway toward more intelligent, reliable, and efficient multimodal AI models.

Zoom link: snu-ac-kr.zoom.us/j/9801363839?omn=88188091043

Speaker Introduction

Jaemin Cho is a PhD candidate at the Department of Computer Science at UNC Chapel Hill and an incoming Assistant Professor at the Department of Computer Science at Johns Hopkins University. His research focuses on improving the reasoning capabilities in multimodal generation. His work has been featured at top conferences in computer vision (CVPR, ICCV, ECCV), natural language processing (EMNLP, NAACL, COLM), and machine learning (NeurIPS, ICML, ICLR, AAAI). His work has been recognized through multiple oral presentations at NeurIPS and ICLR, the Bloomberg Data Science PhD Fellowship, and media coverage (MIT Technology Review, IEEE Spectrum, and WIRED). He also has co-organized the T4V: Transformers for Vision workshop at CVPR 2023, 2024, and 2025.

expand_less

Simulation, Representation, Automation: Human-Centered AI for Visualization Design

expand_more

nanoML: Pushing the Limits of Edge AI with Weightless Neural Networks

List

Seminars

Modular and Interpretable Multimodal AI for Improved Generation and Evaluation

Community

Modular and Interpretable Multimodal AI for Improved Generation and Evaluation