1st MIAI Distinguished lecture on September 30th from 4:30 to 5:30 p.m.

on the September 30, 2021

from 4:30 to 5:30 p.m.
We are pleased to share with you the first seminar to inaugurate this series on September 30th, with Prof. Kristen Grauman (UT Austin / Facebook AI Research) who will give a distinguished lecture.

Sights, sounds, and space:
Audio-visual learning in 3D environments

KG
 
Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Director in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on visual recognition, video, and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2013 Computers and Thought Award. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She served as an Associate Editor-in-Chief for PAMI and as a Program Chair of CVPR 2015 and NeurIPS 2018.

ABSTRACT

Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We explore audio-visual learning in complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object, use echolocation to anticipate its 3D surroundings, and discover the link between its visual inputs and spatial sound.

To support this goal, we introduce SoundSpaces: a platform for audio rendering based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica). SoundSpaces makes it possible to insert arbitrary sound sources in an array of real-world scanned environments. Building on this platform, we pursue a series of audio-visual spatial learning tasks. Specifically, in audio-visual navigation, the agent is tasked with traveling to a sounding target in an unfamiliar environment (e.g., go to the ringing phone). In audio-visual floorplan reconstruction, a short video with audio is converted into a house-wide map, where audio allows the system to “see” behind the camera and behind walls. For self-supervised feature learning, we explore how echoes observed in training can enrich an RGB encoder for downstream spatial tasks including monocular depth estimation. Our results suggest how audio can benefit visual understanding of 3D spaces, and our research lays groundwork for new research in embodied AI with audio-visual perception.


vignette
Published on September 22, 2021