In this thesis, we studied deep learning based approaches to estimate different 3D properties of an object. As a result, we proposed methods that make use of either a single image or a single point cloud to reason about an object’s geometry. We started from a very recent problem, 3D shape reconstruction from a single-view RGB image. We observed that some of the existing methods work for synthetic images only and they fail when they are executed for real images (with background). While other approaches can extract 3D shapes from real images, however, their estimations are not smooth, sharp and complete. By considering the background as a major limitation of the existing methods, we proposed two solutions. The first solution (baseline solution) enables the execution of the synthetic methods for the real dataset. The solution is based on two modules; a segmenter and a reconstruction. The segmenter module takes a real image, segments the object of interest, and pastes the segmented object in the center of the white image. The processed image (which seems similar to the synthetic image) is passed to the reconstructor that estimates the object’s 3D shape. We found that the solution has increased the performance of the existing synthetic approaches for real images. Since the baseline solution is based on a segmenter module, it can not be considered an optimal solution. It is due to the fact that the reconstruction accuracy is totally dependent on the output of the segmenter – if the object is not segmented accurately, the reconstructor will not reconstruct the accurate 3D shape. To solve this problem, we present a second solution that removes the requirement of the segmenter module. Instead of segmenting the object from the image, it separates the features of the object of interest by filtering the features of the background. The object’s features are later used to reconstruct the object’s 3D shape. The reconstructed shapes are compared with those of the State-Of-The-Art (SOTA) approaches. It is found that the proposed approach outperforms them by estimating comparatively more accurate, smooth, sharp and complete 3D shapes. The proposed two object reconstruction solutions produce 3D shapes always in the canonical pose. However, for many applications such as object grasping manipulators, pose information is required. Considering that the object pose can be estimated using the keypoints, we conducted research to estimate such keypoints from images in a supervised way and from point clouds in a self-supervised setting. Our first keypoints estimation approach takes a single-view RGB image as input, extracts pixel-wise features and uses them to estimate keypoints in 3D space. The designed network is trained in a fully supervised way using the ground truth human-annotated keypoints. Moreover, the approach also estimates a confidence score for every keypoint representing its validity. Based on the confidence scores, the network separates valid keypoints from the estimated N keypoints based on the object’s geometry. The valid keypoints are used to estimate the relative pose between different views of an object. It is found that the angular distance error of the proposed approach is comparatively lower than that of the SOTA approaches. The first presented keypoints estimation approach uses only RGB images to estimate 3D keypoints without using any 3D/depth information as input. Thus in some cases, the keypoints are not accurately predicted. Therefore as a second approach, we present a teacher-student architecture to estimate the keypoints from a single-view RGB image. The network is trained in two steps: first, the teacher module is trained to extract 3D features from point clouds, and second, the teacher module teaches the student module to produce 3D features from RGB images that are similar to those achieved from point clouds. During inference, the network only uses only the student module and extracts 2D and 3D features directly from an RGB image to estimate keypoints in 3D space. The keypoints are compared with those of the existing approaches, including the previously proposed keypoints estimation approach. The results show that the keypoints estimated by the proposed approach are more accurate for computing relative pose between different views of an object. It can be observed that the above two keypoints estimation solutions are fully supervised and require a huge dataset with ground truth human-annotated keypoints. This limits the reusability of the approaches since very limited datasets contain accurate keypoint annotations. Therefore, as a third approach, we present an approach that estimates keypoints in a self- supervised without using any ground truth information. Although estimating keypoints similar to human-annotated ones without supervision is a challenging task, the proposed approach estimates the keypoints that best characterize the object’s shape. We achieved this by utilizing a combination of loss components that forces the estimated keypoints towards the object’s surface and prevents them from moving away from the object. The approach is tested for rotated, noisy and decimated point clouds, and it is found that it outperforms the SOTA un-/self-supervised approaches. Apart from the contributions and comparisons with the competitor approaches, the thesis also presents limitations, possible extensions and real-world applications of the proposed approaches.

Spatial Reasoning for 3D Shape Understanding

ZOHAIB, MOHAMMAD
2023-03-27

Abstract

In this thesis, we studied deep learning based approaches to estimate different 3D properties of an object. As a result, we proposed methods that make use of either a single image or a single point cloud to reason about an object’s geometry. We started from a very recent problem, 3D shape reconstruction from a single-view RGB image. We observed that some of the existing methods work for synthetic images only and they fail when they are executed for real images (with background). While other approaches can extract 3D shapes from real images, however, their estimations are not smooth, sharp and complete. By considering the background as a major limitation of the existing methods, we proposed two solutions. The first solution (baseline solution) enables the execution of the synthetic methods for the real dataset. The solution is based on two modules; a segmenter and a reconstruction. The segmenter module takes a real image, segments the object of interest, and pastes the segmented object in the center of the white image. The processed image (which seems similar to the synthetic image) is passed to the reconstructor that estimates the object’s 3D shape. We found that the solution has increased the performance of the existing synthetic approaches for real images. Since the baseline solution is based on a segmenter module, it can not be considered an optimal solution. It is due to the fact that the reconstruction accuracy is totally dependent on the output of the segmenter – if the object is not segmented accurately, the reconstructor will not reconstruct the accurate 3D shape. To solve this problem, we present a second solution that removes the requirement of the segmenter module. Instead of segmenting the object from the image, it separates the features of the object of interest by filtering the features of the background. The object’s features are later used to reconstruct the object’s 3D shape. The reconstructed shapes are compared with those of the State-Of-The-Art (SOTA) approaches. It is found that the proposed approach outperforms them by estimating comparatively more accurate, smooth, sharp and complete 3D shapes. The proposed two object reconstruction solutions produce 3D shapes always in the canonical pose. However, for many applications such as object grasping manipulators, pose information is required. Considering that the object pose can be estimated using the keypoints, we conducted research to estimate such keypoints from images in a supervised way and from point clouds in a self-supervised setting. Our first keypoints estimation approach takes a single-view RGB image as input, extracts pixel-wise features and uses them to estimate keypoints in 3D space. The designed network is trained in a fully supervised way using the ground truth human-annotated keypoints. Moreover, the approach also estimates a confidence score for every keypoint representing its validity. Based on the confidence scores, the network separates valid keypoints from the estimated N keypoints based on the object’s geometry. The valid keypoints are used to estimate the relative pose between different views of an object. It is found that the angular distance error of the proposed approach is comparatively lower than that of the SOTA approaches. The first presented keypoints estimation approach uses only RGB images to estimate 3D keypoints without using any 3D/depth information as input. Thus in some cases, the keypoints are not accurately predicted. Therefore as a second approach, we present a teacher-student architecture to estimate the keypoints from a single-view RGB image. The network is trained in two steps: first, the teacher module is trained to extract 3D features from point clouds, and second, the teacher module teaches the student module to produce 3D features from RGB images that are similar to those achieved from point clouds. During inference, the network only uses only the student module and extracts 2D and 3D features directly from an RGB image to estimate keypoints in 3D space. The keypoints are compared with those of the existing approaches, including the previously proposed keypoints estimation approach. The results show that the keypoints estimated by the proposed approach are more accurate for computing relative pose between different views of an object. It can be observed that the above two keypoints estimation solutions are fully supervised and require a huge dataset with ground truth human-annotated keypoints. This limits the reusability of the approaches since very limited datasets contain accurate keypoint annotations. Therefore, as a third approach, we present an approach that estimates keypoints in a self- supervised without using any ground truth information. Although estimating keypoints similar to human-annotated ones without supervision is a challenging task, the proposed approach estimates the keypoints that best characterize the object’s shape. We achieved this by utilizing a combination of loss components that forces the estimated keypoints towards the object’s surface and prevents them from moving away from the object. The approach is tested for rotated, noisy and decimated point clouds, and it is found that it outperforms the SOTA un-/self-supervised approaches. Apart from the contributions and comparisons with the competitor approaches, the thesis also presents limitations, possible extensions and real-world applications of the proposed approaches.
27-mar-2023
3D reconstruction; 3D keypoints estimation; pose estimation; geometry understanding; Object detection;
File in questo prodotto:
File Dimensione Formato  
phdunige_4049960.pdf

Open Access dal 28/03/2024

Descrizione: Complete PhD Thesis
Tipologia: Tesi di dottorato
Dimensione 28.22 MB
Formato Adobe PDF
28.22 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1109711
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact