UNC Chapel Hill
The paradigm of training large-scale foundation models has driven significant advancements in multimodal AI. However, pursuing further performance gains solely through model scaling is becoming impractical due to rising computational costs and resource limitations. Moreover, the reasoning and generation processes of these models remain mostly uninterpretable and uncontrollable, often leading to unfaithful outputs. In this talk, I will discuss my efforts to make multimodal generative models more controllable and trustworthy without increasing their size. First, I will introduce faithful reasoning frameworks, where the multimodal generation process mirrors how humans reason about and create content such as images and videos. Concretely, in these frameworks, models create a detailed plan that decomposes a complex generation task into simpler steps as well as retrieve relevant information from multimodal knowledge bases before generating the final outputs. Next, I will describe fine-grained evaluation methods that assess model capabilities across multiple dimensions, such as object counting and spatial relation understanding, thereby providing a detailed understanding of the strengths and weaknesses of models. In turn, these evaluations enable targeted model improvements that address identified weaknesses through test-time guidance or by updating training environments. Together, these directions offer a pathway toward more intelligent, reliable, and efficient multimodal AI models.
Zoom link: snu-ac-kr.zoom.us/j/9801363839?omn=88188091043
Jaemin Cho is a PhD candidate at the Department of Computer Science at UNC Chapel Hill and an incoming Assistant Professor at the Department of Computer Science at Johns Hopkins University. His research focuses on improving the reasoning capabilities in multimodal generation. His work has been featured at top conferences in computer vision (CVPR, ICCV, ECCV), natural language processing (EMNLP, NAACL, COLM), and machine learning (NeurIPS, ICML, ICLR, AAAI). His work has been recognized through multiple oral presentations at NeurIPS and ICLR, the Bloomberg Data Science PhD Fellowship, and media coverage (MIT Technology Review, IEEE Spectrum, and WIRED). He also has co-organized the T4V: Transformers for Vision workshop at CVPR 2023, 2024, and 2025.