Chuang Gan

Principal Research Staff Member - MIT-IBM Watson AI Lab

Invited Talk: Multimodal Intelligence

Abstract:

Existing research on video understanding has been primarily focusing on the visual modalities of videos (e.g., from imagery and/or depth sensors), whereas the video data is inherently multi-modal (e.g., with meta-data, sound, text, etc.). The utility of multimodal integration in humans is well appreciated. From an early age, infants begin to recognize, interact, and understand the physical world through the multi-modal integration of static imagery, motion, sound as well as language. In this talk, I will first show the application of jointly using visual and audio cues for sound separation, sound source localization, and music generation. Then I will demonstrate how language and concepts could help machine learning models reason about the visual scene, physics, and casual relationship.

Biography:

Chuang Gan is a principal research staff member at MIT-IBM Watson AI Lab. I am also a visiting research scientist at MIT, working closely with Prof. Antonio Torralba and Prof. Josh Tenenbaum. Before that, he completed his Ph.D. with the highest honor at Tsinghua University, supervised by Prof. Andrew Chi-Chih Yao. His primary focuses are on video understanding, including representation learning, neural-symbolic visual reasoning, audio-visual scene analysis, and skill learning. His research works have been recognized by Microsoft Fellowship, Baidu Fellowship, and media coverage from CNN, BBC, The New York Times, WIRED, Forbes, and MIT Technology Review.