The prediction of affordances i.e., the potential actions an agent can perform on objects in the scene, is fundamental for human-robot collaboration and wearable robotics scenarios in which objects may be on a tabletop or held by a person. Perceiving affordances from an image is challenging due to the variety of object geometric and physical properties, as well as occlusions caused by clutter or by a person's hand holding the object. In this thesis, we propose a framework for visual affordance prediction that estimates object properties such as position and mass, and identifies graspable regions of objects, supporting the agent to perform the intended actions. As previous methods focused on predicting the filling mass of a container manipulated by a human, the complementary estimation of container mass regardless of the content was underexplored. Moreover, during a human manipulation more than one object could be in the scene, so a selection phase is necessary to focus only on the object of interest. We propose a strategy to select the object manipulated by a human from a fixed frontal RGB-D camera and we design a model to predict its mass. The model learns how to combine color and geometric information to predict the (empty) container mass. The integration of our pipeline with already existing filling mass predictors allows to obtain the complete container mass (object plus content). Object detection methods identify objects in a scene, however in wearable robotic applications the human knows objects location and category. We investigate a transfer learning procedure to locate objects in the scene regardless of their category (`objectness'). We target lightweight object detection models that could be used in a wearable application, where the trade-off between accuracy and computational cost is relevant and was previously not investigated. In case of human manipulations, the identification of the object regions an agent can interact with is more challenging due to occlusions and the poses object may take. We design an affordance segmentation model that learns affordance features under hand-occlusion by weighting the feature map through arm and object segmentation. Due to a lack of datasets to tackle this scenario, we complement an existing dataset, annotating the visual affordances of mixed-reality images of hand-held containers in third-person view. Experiments show that the strategy to select objects and predict their mass outperforms most baselines on previously unseen manipulated containers; the transfer learning procedure improves the performance of lightweight object detection methods in a wearable application; and the affordance segmentation model achieves better affordance segmentation and generalisation than existing models.
Visual Affordance Prediction of Hand-Occluded Objects
APICELLA, TOMMASO
2024-03-27
Abstract
The prediction of affordances i.e., the potential actions an agent can perform on objects in the scene, is fundamental for human-robot collaboration and wearable robotics scenarios in which objects may be on a tabletop or held by a person. Perceiving affordances from an image is challenging due to the variety of object geometric and physical properties, as well as occlusions caused by clutter or by a person's hand holding the object. In this thesis, we propose a framework for visual affordance prediction that estimates object properties such as position and mass, and identifies graspable regions of objects, supporting the agent to perform the intended actions. As previous methods focused on predicting the filling mass of a container manipulated by a human, the complementary estimation of container mass regardless of the content was underexplored. Moreover, during a human manipulation more than one object could be in the scene, so a selection phase is necessary to focus only on the object of interest. We propose a strategy to select the object manipulated by a human from a fixed frontal RGB-D camera and we design a model to predict its mass. The model learns how to combine color and geometric information to predict the (empty) container mass. The integration of our pipeline with already existing filling mass predictors allows to obtain the complete container mass (object plus content). Object detection methods identify objects in a scene, however in wearable robotic applications the human knows objects location and category. We investigate a transfer learning procedure to locate objects in the scene regardless of their category (`objectness'). We target lightweight object detection models that could be used in a wearable application, where the trade-off between accuracy and computational cost is relevant and was previously not investigated. In case of human manipulations, the identification of the object regions an agent can interact with is more challenging due to occlusions and the poses object may take. We design an affordance segmentation model that learns affordance features under hand-occlusion by weighting the feature map through arm and object segmentation. Due to a lack of datasets to tackle this scenario, we complement an existing dataset, annotating the visual affordances of mixed-reality images of hand-held containers in third-person view. Experiments show that the strategy to select objects and predict their mass outperforms most baselines on previously unseen manipulated containers; the transfer learning procedure improves the performance of lightweight object detection methods in a wearable application; and the affordance segmentation model achieves better affordance segmentation and generalisation than existing models.File | Dimensione | Formato | |
---|---|---|---|
phdunige_4111548.pdf
accesso aperto
Tipologia:
Tesi di dottorato
Dimensione
21.18 MB
Formato
Adobe PDF
|
21.18 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.