VLADR:
Vision and Language for Autonomous Driving and Robotics

CVPR 2024 Workshop, Seattle WA, USA
Jun 18th (Tuesday), 2024

Introduction

The contemporary discourse in technological advancement underscores the increasingly intertwined roles of vision and language processing, especially within the realms of autonomous driving and robotics. The necessity for this symbiosis is apparent when considering the multifaceted dynamics of real-world environments. An autonomous vehicle, for instance, operating within the framework of urban locales should not merely rely on its visual sensors for pedestrian detection, but must also interpret and act upon auditory signals, like vocalized warnings. Similarly, robots that integrate visual data with linguistic context promise more adaptive functionalities, particularly in diverse settings. This workshop is expected to spotlight the intricate arena of data-centric autonomous driving, emphasizing vision-based techniques. Central to our discussions will be topics like vision and language for autonomous driving, language-driven perception, and simulation. We will delve into the nuanced realms of vision and language representation learning and explore the future of multimodal motion prediction and planning in robotics. Recognizing the rapid expansion of this field, the introduction of new datasets and metrics for multimodal learning will also be on our agenda. Equally paramount are the discussions on privacy concerns associated with multimodal data. Moreover, our emphasis will firmly rest on safety, ensuring that systems are adept at correctly interpreting and acting on both visual and linguistic inputs, thereby preventing potential mishaps in real-world scenarios. Through a comprehensive examination of these topics, this workshop seeks to foster a deeper academic understanding of the intersection between vision and language in autonomous systems. By convening experts from interdisciplinary fields, our objective is to decipher current state-of-the-art methodologies, address challenges, and chart avenues for future endeavors, ensuring our findings resonate within both academic and industrial communities.



Call for Papers

The CVPR 2024 Vision and Language for Autonomous Driving and Robotics Workshop (https://vision-language-adr.github.io) is expected to center around data-centric autonomous driving, with a particular focus on vision-based methods.

This workshop is intended to:

  • Explore potential areas in robotics that vision and language could help
  • Encourage the communication and collaboration of vision and language for autonomous agents
  • Provide an opportunity for CVPR community to discuss this exciting and growing area of multimodal representations

We welcome paper submissions on all topics related to neural fields for autonomous driving and robotics, including but not limited to:

  • Vision and language for autonomous driving
  • Language-driven perception
  • Language-driven sensor and traffic simulation
  • Vision and language representation learning
  • Multimodal motion prediction and planning for robotics
  • New datasets and metrics for multimodal learning
  • Safety: Ensuring that systems can correctly interpret and act upon visual and linguistic inputs in real-world situations to prevent accidents
  • Language agents for robotics
  • Language-based scene understanding for driving scenarios
  • Multi-modal fusion for end-to-end autonomous driving
  • Large-Language-Models (LLMs) as task planner
  • Other applications of LLMs to driving and robotics

Style and Author Instructions

  • Paper Length: We ask authors to use the official CVPR2024 template and limit submissions to 4-8 pages excluding references.
  • Dual Submissions: The workshop is non-archival. In addition, in light of the new single-track policy of CVPR 2024, we strongly encourage papers accepted to CVPR 2024 to present at our workshop.
  • Presentation Forms: All accepted papers will get poster presentations during the workshop; selected papers will get oral presentations.

All submissions should anonymized. Papers with more than 4 pages (excluding references) will be reviewed as long papers, and papers with more than 8 pages (excluding references) will be rejected without review. Supplementary material is optional with supported formats: pdf, mp4 and zip. All papers that were not previously presented in a major conference, will be peer-reviewed by three experts in the field in a double-blind manner. In case you are submitting a previously accepted conference paper, please also attach a copy of the acceptance notification email in the supplementary material documents.

All submissions should adhere to the CVPR 2024 author guidelines.

Contact: If you have any questions, please contact vladr@googlegroups.com.

Submission Portal: https://openreview.net/group?id=thecvf.com/CVPR/2024/Workshop/VLADR

Paper Review Timeline:

Paper Submission and supplemental material deadline March 29th, 2024 (PST)
Notification to authors April 14th April 21st, 2024 (PST)
Camera ready deadline April 28th May 4th, 2024 (PST)



Invited Speakers

Jitendra Malik

Jitendra Malik

Professor at UC Berkeley

Jitendra Malik received the B.Tech degree in Electrical Engineering from Indian Institute of Technology, Kanpur in 1980 and the PhD degree in Computer Science from Stanford University in 1985. In January 1986, he joined the university of California at Berkeley, where he is currently the Arthur J. Chick Professor in the Department of Electrical Engineering and Computer Sciences. He is also on the faculty of the department of Bioengineering, and the Cognitive Science and Vision Science groups. During 2002-2004 he served as the Chair of the Computer Science Division, and as the Department Chair of EECS during 2004-2006 as well as 2016-2017. In 2018 and 2019, he served as Research Director and Site Lead of Facebook AI Research in Menlo Park. Prof. Malik's research group has worked on many different topics in computer vision, computational modeling of human vision, computer graphics and the analysis of biological images. Several well-known concepts and algorithms arose in this research, such as anisotropic diffusion, normalized cuts, high dynamic range imaging, shape contexts and R-CNN. He has mentored more than 70 PhD students and postdoctoral fellows.

Trevor Darrell

Trevor Darrell

Professor at UC Berkeley

Prof. Darrell is on the faculty of the CS and EE Divisions of the EECS Department at UC Berkeley. He founded and co-leads Berkeley's Berkeley Artificial Intelligence Research (BAIR) lab, the Berkeley DeepDrive (BDD) Industrial Consortia, and the recently launched BAIR Commons program in partnership with Facebook, Google, Microsoft, Amazon, and other partners. He also was Faculty Director of the PATH research center at UC Berkeley, and led the Vision group at the UC-affiliated International Computer Science Institute in Berkeley from 2008-2014. Prior to that, Prof. Darrell was on the faculty of the MIT EECS department from 1999-2008, where he directed the Vision Interface Group. He was a member of the research staff at Interval Research Corporation from 1996-1999, and received the S.M., and PhD. degrees from MIT in 1992 and 1996, respectively. He obtained the B.S.E. degree from the University of Pennsylvania in 1988.

Chelsea Finn

Chelsea Finn

Assistant Professor at Stanford University

Chelsea Finn is an Assistant Professor in Computer Science and Electrical Engineering at Stanford University. Chelsea is interested in the capability of robots and other agents to develop broadly intelligent behavior through learning and interaction. Chelsea also spent time at Google as a part of the Google Brain team.

Fei Xia

Fei Xia

Research Scientist at Google Deepmind Robotics

Fei Xia is a Research Scientist at Google Research where he works on the Robotics team. He received his PhD degree from the Department of Electrical Engineering, Stanford University. He was co-advised by Silvio Savarese in SVL and Leonidas Guibas. His mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments, with applications to home robotics. He has been approaching this problem from 3 aspects: 1) Large-scale and transferrable simulation for Robotics. 2) Learning algorithms for long-horizon tasks. 3) Combining geometric and semantic representation for environments. Most recently, He has been exploring using foundation models for robot decision-making.



Long Chen

Long Chen

Staff Scientist at Wayve

Long Chen is a Staff Scientist at Wayve, focusing on building Vision Language Action Models (VLAM) for the next wave of autonomous driving, including groundbreaking work on Driving with LLMs and LINGO. Previously, he was a research engineer at Lyft Level 5, where he led the development of data-driven planning models using crowd-sourced data for Lyft's self-driving cars. Long received a PhD from Bournemouth University and a master’s degree from University College London, where his research focused on applying AI in various domains such as mixed reality, surgical robotics, and healthcare.



Tentative Schedule

Opening remarks and welcome 08:55 AM - 09:00 AM
Chelsea Finn 09:00 AM - 09:45 AM
Trevor Darrell 10:00 AM - 10:45 AM
Jitendra Malik 11:05 AM - 11:50 AM
Poster Session & Lunch 12:00 AM - 02:00 PM
Fei Xia 02:00 PM - 02:45 PM
Long Chen 03:00 PM - 03:45 PM
Oral Session 04:00 PM - 05:00 PM

Accepted Papers


[Oral] RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation

Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, Yunzhu Li

[OpenReview]  

On the Safety Concerns of Deploying LLMs/VLMs in Robotics: Highlighting the Risks and Vulnerabilities

Xiyang Wu, Ruiqi Xian, Tianrui Guan, Jing Liang, Souradip Chakraborty, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, Amrit Bedi

[OpenReview]  

Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

Kaavya Rekanar, Martin Hayes, Ganesh Sistu, Ciaran Eising

[OpenReview]  

[Oral] Collision Avoidance Metric for 3D Camera Evaluation

Vage Taamazyan, Alberto Dall'Olio, Agastya Kalra

[OpenReview]  

Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving

Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi

[OpenReview]  

[Oral] DriveLM: Driving with Graph Visual Question Answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li

[OpenReview]  

[Oral] AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker

[OpenReview]  

Ambiguous Annotations: When is a Pedestrian not a Pedestrian?

Luisa Schwirten, Jannes Scholz, Daniel Kondermann, Janis Keuper

[OpenReview]  

Envisioning the Unseen: Revolutionizing Indoor Spaces with Deep Learning-Enhanced 3D Semantic Segmentation

Muhammad Arif

[OpenReview]  

Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Drivingn

Muhammad Arif

[OpenReview]  

Safedrive Dreamer: Navigating Safety-Critical Scenarios in the Real-world with World Models

Bangan Wang, Haitao Li, Tianyu Shi

[OpenReview]  

Improving End-To-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Harsh Goel, Sai Shankar Narasimhan

[OpenReview]  

ATLAS: Adaptive Landmark Acquisition using LLM-Guided Navigation

Utteja Kallakuri, Bharat Prakash, Arnab Neelim Mazumder, Hasib-Al Rashid, Nicholas R Waytowich, Tinoosh Mohsenin

[OpenReview]  

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, Joyce Chai

[OpenReview]  

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Ross Greer, Mathias Viborg Andersen, Andreas Møgelmose, Mohan Trivedi

[OpenReview]  

Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

Ross Greer, Bjørk Antoniussen, Andreas Møgelmose, Mohan Trived

[OpenReview]  

Evolutionary Reward Design and Optimization with Multimodal Large Language Models

Ali Emre Narin

[OpenReview]  

[Oral] Open6DOR: Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach

RYufei Ding, Haoran Geng, Chaoyi Xu, Xiaomeng Fang, Jiazhao Zhang, Songlin Wei, Qiyu Dai, Zhizheng Zhang, He Wang

[OpenReview]  


Top