Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
Loading...
Date
2025-08-22
Authors
Advisor
Czarnecki, Krzysztof
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
3D object detection is a fundamental task in the autonomous driving perception pipeline, where identifying and localizing objects within the surrounding environment is critical for safe and robust decision-making. However, traditional 3D object detectors are limited by their reliance on a closed set of training categories, rendering them incapable of recognizing novel or out-of-distribution objects encountered in open-world driving scenarios. To address this limitation, the field of open-vocabulary (OV)-3D object detection has emerged, aiming to generalize beyond predefined label sets by leveraging vision-language models (VLMs) to align 3D object proposals with semantically rich 2D language-informed features.
Despite promising results, a major challenge in OV-3D object detection lies in achieving robust cross-modal alignment between 3D and 2D features, which is often compromised by noisy annotations, occlusions, and resolution inconsistencies that disrupt semantic coherence. In this thesis, we present OV-SCAN, a novel framework for Open-Vocabulary 3D object detection that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN introduces a two-stage strategy: (1) discovering precise 3D annotations for novel objects using vision-language supervision, and (2) filtering out semantically inconsistent or low-quality 3D–2D training pairs that arise from annotation errors and sensor limitations.
We validate the effectiveness of OV-SCAN through comprehensive experiments on autonomous driving benchmarks, where our framework consistently outperforms existing methods in the OV-3D object detection task. Overall, OV-SCAN underscores the critical role of semantic consistency in cross-modal alignment and demonstrates its potential as a scalable solution for discovering and localizing novel objects in real-world autonomous driving scenarios.
Description
Keywords
Open-Vocabulary, Computer Vision, 3D Object Detection, Vision Language Model, Autonomous Driving