This thesis presents single or multi-modal-based approaches for 3D object localization and detection, referred to as 3D scene perception for a wide range of applications such as augmented reality, visualizing places and stories, creating a safer traffic environment, autonomous driving, or robotics. We started to solve a problem of 3D object localization from multi-views (images only), which is a basic problem in augmented reality, city planning, and urban objects management, where our proposed approach accurately localizes street-level objects in urban areas by combining information from crowd-sourced images with structure-from-motion (SfM) sparse point clouds. Compared to other existing approaches that use GPS coordinates along images, we showed that exploring SfM points helps in increasing the accuracy of 3D object localization. Broadly, this research work is 3D scene perception from multiple camera images. We then moved to expand the scope of our research to more challenging applications such as autonomous driving and robotics, where we solved multimodal (LiDAR and camera) fusion problems for 3D object detection. We developed a novel fusion scheme that first transforms modality-specific information into an intermediate-level features representation and then fuses them through our carefully designed deep learning-based fusion module. Instead of existing early and late fusion schemes that fail to align diverse modalities and have semantic ambiguities, our intermediate-level transformation and fusion method aligns them better and nicely preserves both geometric and semantic information about 3D objects. Broadly, this research is 3D scene perception from multi-modalities. Overall, our research investigates the significance of transforming feature representations derived from multi-sensors either multi-cameras or LiDAR-cameras into meaningful representation and fusion for 3D object localization and detection. The transformations and fusion processes are based on traditional ways alongside more advanced neural networks, involving thorough analysis to support our design choices. We carry out extensive experiments including in-depth ablation studies on widely accessible large-scale datasets like Mapillary, KITTI, and NuScenes. Our results demonstrate the effectiveness of our modules of feature extraction and meaningful transformations, and final 3D localization and detection, showcasing the ability to tackle the complexities of diverse scenes. Alongside quantitative evaluation, we also present engaging visualizations depicting scenes at both overall and intermediate stages. Additionally, in the end, we also worked on a multi-sensor (LiDAR and cameras) calibration and data annotation mapping toolkit necessary for the multimodal 3D scene perception tasks we solved.
Multimodal 3D Scene Perception
AHMAD, JAVED
2024-03-29
Abstract
This thesis presents single or multi-modal-based approaches for 3D object localization and detection, referred to as 3D scene perception for a wide range of applications such as augmented reality, visualizing places and stories, creating a safer traffic environment, autonomous driving, or robotics. We started to solve a problem of 3D object localization from multi-views (images only), which is a basic problem in augmented reality, city planning, and urban objects management, where our proposed approach accurately localizes street-level objects in urban areas by combining information from crowd-sourced images with structure-from-motion (SfM) sparse point clouds. Compared to other existing approaches that use GPS coordinates along images, we showed that exploring SfM points helps in increasing the accuracy of 3D object localization. Broadly, this research work is 3D scene perception from multiple camera images. We then moved to expand the scope of our research to more challenging applications such as autonomous driving and robotics, where we solved multimodal (LiDAR and camera) fusion problems for 3D object detection. We developed a novel fusion scheme that first transforms modality-specific information into an intermediate-level features representation and then fuses them through our carefully designed deep learning-based fusion module. Instead of existing early and late fusion schemes that fail to align diverse modalities and have semantic ambiguities, our intermediate-level transformation and fusion method aligns them better and nicely preserves both geometric and semantic information about 3D objects. Broadly, this research is 3D scene perception from multi-modalities. Overall, our research investigates the significance of transforming feature representations derived from multi-sensors either multi-cameras or LiDAR-cameras into meaningful representation and fusion for 3D object localization and detection. The transformations and fusion processes are based on traditional ways alongside more advanced neural networks, involving thorough analysis to support our design choices. We carry out extensive experiments including in-depth ablation studies on widely accessible large-scale datasets like Mapillary, KITTI, and NuScenes. Our results demonstrate the effectiveness of our modules of feature extraction and meaningful transformations, and final 3D localization and detection, showcasing the ability to tackle the complexities of diverse scenes. Alongside quantitative evaluation, we also present engaging visualizations depicting scenes at both overall and intermediate stages. Additionally, in the end, we also worked on a multi-sensor (LiDAR and cameras) calibration and data annotation mapping toolkit necessary for the multimodal 3D scene perception tasks we solved.File | Dimensione | Formato | |
---|---|---|---|
phdunige_4971429.pdf
accesso aperto
Descrizione: The attached file is the complete thesis.
Tipologia:
Tesi di dottorato
Dimensione
51.02 MB
Formato
Adobe PDF
|
51.02 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.