Vision and Language for Autonomous Driving and Robotics

CVPR 2024 Workshop, Seattle WA, USA


The contemporary discourse in technological advancement underscores the increasingly intertwined roles of vision and language processing, especially within the realms of autonomous driving and robotics. The necessity for this symbiosis is apparent when considering the multifaceted dynamics of real-world environments. An autonomous vehicle, for instance, operating within the framework of urban locales should not merely rely on its visual sensors for pedestrian detection, but must also interpret and act upon auditory signals, like vocalized warnings. Similarly, robots that integrate visual data with linguistic context promise more adaptive functionalities, particularly in diverse settings. This workshop is expected to spotlight the intricate arena of data-centric autonomous driving, emphasizing vision-based techniques. Central to our discussions will be topics like vision and language for autonomous driving, language-driven perception, and simulation. We will delve into the nuanced realms of vision and language representation learning and explore the future of multimodal motion prediction and planning in robotics. Recognizing the rapid expansion of this field, the introduction of new datasets and metrics for multimodal learning will also be on our agenda. Equally paramount are the discussions on privacy concerns associated with multimodal data. Moreover, our emphasis will firmly rest on safety, ensuring that systems are adept at correctly interpreting and acting on both visual and linguistic inputs, thereby preventing potential mishaps in real-world scenarios. Through a comprehensive examination of these topics, this workshop seeks to foster a deeper academic understanding of the intersection between vision and language in autonomous systems. By convening experts from interdisciplinary fields, our objective is to decipher current state-of-the-art methodologies, address challenges, and chart avenues for future endeavors, ensuring our findings resonate within both academic and industrial communities.

Call for Papers

The CVPR 2024 Vision and Language for Autonomous Driving and Robotics Workshop (https://vision-language-adr.github.io) is expected to center around data-centric autonomous driving, with a particular focus on vision-based methods.

This workshop is intended to:

  • Explore potential areas in robotics that vision and language could help
  • Encourage the communication and collaboration of vision and language for autonomous agents
  • Provide an opportunity for CVPR community to discuss this exciting and growing area of multimodal representations

We welcome paper submissions on all topics related to neural fields for autonomous driving and robotics, including but not limited to:

  • Vision and language for autonomous driving
  • Language-driven perception
  • Language-driven sensor and traffic simulation
  • Vision and language representation learning
  • Multimodal motion prediction and planning for robotics
  • New datasets and metrics for multimodal learning
  • Safety: Ensuring that systems can correctly interpret and act upon visual and linguistic inputs in real-world situations to prevent accidents
  • Language agents for robotics
  • Language-based scene understanding for driving scenarios
  • Multi-modal fusion for end-to-end autonomous driving
  • Large-Language-Models (LLMs) as task planner
  • Other applications of LLMs to driving and robotics

Style and Author Instructions

  • Paper Length: We ask authors to use the official CVPR2024 template and limit submissions to 4-8 pages excluding references.
  • Dual Submissions: The workshop is non-archival. In addition, in light of the new single-track policy of CVPR 2024, we strongly encourage papers accepted to CVPR 2024 to present at our workshop.
  • Presentation Forms: All accepted papers will get poster presentations during the workshop; selected papers will get oral presentations.

All submissions should anonymized. Papers with more than 4 pages (excluding references) will be reviewed as long papers, and papers with more than 8 pages (excluding references) will be rejected without review. Supplementary material is optional with supported formats: pdf, mp4 and zip. All papers that were not previously presented in a major conference, will be peer-reviewed by three experts in the field in a double-blind manner. In case you are submitting a previously accepted conference paper, please also attach a copy of the acceptance notification email in the supplementary material documents.

All submissions should adhere to the CVPR 2024 author guidelines.

Contact: If you have any questions, please contact vladr@googlegroups.com.

Submission Portal: https://openreview.net/group?id=thecvf.com/CVPR/2024/Workshop/VLADR

Paper Review Timeline:

Paper Submission and supplemental material deadline March 29th, 2024 (PST)
Notification to authors April 14th April 21st, 2024 (PST)
Camera ready deadline April 28th May 4th, 2024 (PST)

Invited Speakers

Jitendra Mali

Jitendra Malik

Professor at UC Berkeley

Jitendra Malik received the B.Tech degree in Electrical Engineering from Indian Institute of Technology, Kanpur in 1980 and the PhD degree in Computer Science from Stanford University in 1985. In January 1986, he joined the university of California at Berkeley, where he is currently the Arthur J. Chick Professor in the Department of Electrical Engineering and Computer Sciences. He is also on the faculty of the department of Bioengineering, and the Cognitive Science and Vision Science groups. During 2002-2004 he served as the Chair of the Computer Science Division, and as the Department Chair of EECS during 2004-2006 as well as 2016-2017. In 2018 and 2019, he served as Research Director and Site Lead of Facebook AI Research in Menlo Park. Prof. Malik's research group has worked on many different topics in computer vision, computational modeling of human vision, computer graphics and the analysis of biological images. Several well-known concepts and algorithms arose in this research, such as anisotropic diffusion, normalized cuts, high dynamic range imaging, shape contexts and R-CNN. He has mentored more than 70 PhD students and postdoctoral fellows.

Trevor Darrell

Trevor Darrell

Professor at UC Berkeley

Prof. Darrell is on the faculty of the CS and EE Divisions of the EECS Department at UC Berkeley. He founded and co-leads Berkeley's Berkeley Artificial Intelligence Research (BAIR) lab, the Berkeley DeepDrive (BDD) Industrial Consortia, and the recently launched BAIR Commons program in partnership with Facebook, Google, Microsoft, Amazon, and other partners. He also was Faculty Director of the PATH research center at UC Berkeley, and led the Vision group at the UC-affiliated International Computer Science Institute in Berkeley from 2008-2014. Prior to that, Prof. Darrell was on the faculty of the MIT EECS department from 1999-2008, where he directed the Vision Interface Group. He was a member of the research staff at Interval Research Corporation from 1996-1999, and received the S.M., and PhD. degrees from MIT in 1992 and 1996, respectively. He obtained the B.S.E. degree from the University of Pennsylvania in 1988.

Dragomir Anguelov

Dragomir Anguelov

Distinguished Researcher at Waymo

Drago Anguelov is a Principal Scientist at Waymo, developing and applying machine learning methods for autonomous vehicle perception and, more generally, in computer vision and robotics. He is an expert in machine learning, with applications to computer vision, robotics, and computer graphics.

Chelsea Finn

Chelsea Finn

Assistant Professor at Stanford University

Chelsea Finn is an Assistant Professor in Computer Science and Electrical Engineering at Stanford University. Chelsea is interested in the capability of robots and other agents to develop broadly intelligent behavior through learning and interaction. Chelsea also spent time at Google as a part of the Google Brain team.

Fei Xia

Fei Xia

Research Scientist at Google Deepmind Robotics

Fei Xia is a Research Scientist at Google Research where he works on the Robotics team. He received his PhD degree from the Department of Electrical Engineering, Stanford University. He was co-advised by Silvio Savarese in SVL and Leonidas Guibas. His mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments, with applications to home robotics. He has been approaching this problem from 3 aspects: 1) Large-scale and transferrable simulation for Robotics. 2) Learning algorithms for long-horizon tasks. 3) Combining geometric and semantic representation for environments. Most recently, He has been exploring using foundation models for robot decision-making.

Jamie Shotton

Jamie Shotton

Chief Scientist at Wayve

As head of Wayve's Discovery department, Jamie leads three distinct areas: the Scaled Intelligence team, which is building our core foundation models for autonomous driving; the Science group, where he's guiding his research teams to invest in new ideas that have the potential to become game-changing technological advances for the company; and the Simulation group, where he's driving the development of simulation tools and technologies critical to unlocking safe and adaptable autonomous driving through off-road measurement. Jamie has been at the forefront of applied AI research for the past 20 years. Before joining Wayve, Jamie was Partner Director of Science at Microsoft and Head of the Mixed Reality & AI Labs. While at Microsoft, Jamie shipped foundational features for Microsoft's Kinect (Microsoft's line of motion sensing input devices) and the hand- and eye-tracking that enable HoloLens 2's interaction model (smart glasses). Jamie has a PhD in computer vision from the University of Cambridge and has received multiple Best Paper and Best Demo Awards at top-tier academic conferences. He was elected a Fellow of the Royal Academy of Engineering in 2021.


Opening remarks and welcome 09:00 AM - 09:15 AM
Keynote talk (TBD) 09:15 AM - 09:45 AM
Keynote talk (TBD) 10:15 AM - 10:45 AM
Keynote talk (TBD) 11:00 AM - 11:30 AM
Poster Session & Lunch 11:30 AM - 01:30 PM
Keynote talk (TBD) 01:30 PM - 02:00 PM
Keynote talk (TBD) 02:15 PM - 02:45 PM
Keynote talk (TBD) 03:00 PM - 03:40 PM
Keynote talk (TBD) 04:15 PM - 05:00 PM