이름:
소속:
주최: 김건희 교수
날짜: 2023. 12. 14. 오후 12:30 - 오후 01:30
위치: 302동 209호
Large Language Models (LLMs) have significantly enhanced the capacity of machines to understand and articulate human language. The progress in LLMs has also led to notable advancements in the vision-language domain, bridging the gap between image encoders and LLMs to combine their reasoning capabilities. However, most of the previous work focused on relatively small-scale models on a specific bi-modal pair (e.g. text and images), due to scaling challenges in model parameters and data availability.
In this talk, I will describe our recent work on Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, motion sensor), and generates textual responses. I will highlight several techniques we applied to efficiently reduce the training load while achieving state-of-the-art zeroshot multimodal reasoning capabilities.
Seungwhan Moon is a Lead Research Scientist at Meta Reality Labs, conducting research in multimodal learning for AR/VR applications. His recent projects have focused on cutting-edge multimodal and knowledge-grounded conversational AI. He received his PhD in Language Technologies at School of Computer Science, Carnegie Mellon University under Prof. Jaime Carbonell. Before joining Meta, he has also worked at various research institutions including Snapchat Research, Samsung Research, etc.