Towards Multimodal Cognitive Architecture for Human-Robot Shared Perception

Eldardeer, OMAR KHALED ELSAYED MOHAMED

doi:10.15167/eldardeer-omar-khaled-elsayed-mohamed_phd2023-05-08

For many years, robots have been used in specific repetitive tasks, especially in industrial contexts. However, in recent years, robots start to be deployed in interactive and collaborative contexts with humans. The cognitive capabilities of robots are one of the main open challenges for effective interactions. Shared Perception is one of the important skills that are important for effective collaboration. In robotics, shared perception was studied from the human perspective (how to enable shared perception in an interaction with a robot). In the cognitive architectures side of research shared perception was never studied except for some skills that are important to enable shared perception (perspective talking, gaze understanding, and gaze following). Therefore, my research firstly bases five general required skills for robotics in shared perception which are Having a common representation, Expressing effective communication, Spatiotemporal coordination, Affective modulation mechanism, and Understanding the other. Indeed, it is a complex skill and requires more than a Ph.D. to cover all concepts. Therefore, the main research activities were building cognitive architectures that try to address different concepts within the first three skills. The main aim is to build cognitive architectures that take the robots one step towards shared perception cognitive architecture. The architectures are built sequentially and based on each other. The approach to building these architectures has four characteristics Biological inspiration, Multi-modality more specifically audio and vision, Generalization (Not targeting a specific task), and Attention-based (Starting with state-of-the-art attention models and building upwards to include higher cognitive capabilities). Following this, the Ph.D. has three main research questions as the following: 1. How can we integrate state-of-the-art vision and audio models to allow the robot to jointly attend to the environment with a human partner? Is the behavior of the robot effectively received by the human partner? and What is the mutual influence between the robot and the human partner during the interaction using this architecture? 2. How can this integrated audio-visual attention architecture be used by the robot to understand a complex audio-visual environment? How can uncertainty be handled? How can the robot actively perceive the environment? 3. Can this perception architecture be generalized to different robots and applied to a complex task that requires coordination with another agent? Each question of these three questions is related to one or more skills within the first three required skills from the mentioned five skills above. Trying to address these questions, I designed a series of architectures that are implemented cumulatively and showed how different cognitive blocks can be integrated to improve the perception capabilities of the robot, dealing with uncertainty, and noise in general conditions. The validation of these architectures was done using multiple robotic platforms (iCub, Pepper robot, and Essex agricultural robot) in different conditions with the robot only, with the existence of a human partner, with another robotic agent, and in a real-world application. Further, I did some research activities related to improving the auditory modality. This was due to the outcomes of the experiments where audio processing was the main bottleneck process in the system. The improvements were building a developmental pipeline for audio. I used this pipeline to create an alternative model for audio localization that achieved very promising results. The second activity in auditory improvements was exploring alternative learning processes other than deep learning models that are more lightweight and suitable for robotic applications. Although the proposed architectures don’t address all the required skills of shared perception, they address some points in that direction and also some open challenges in the field of cognitive architecture. More specifically, mutual influence in human-robot interaction scenarios, cross-modal interaction, and general unified perception modeling. The last integrated proposed architecture is a solid base for the development of a shared perception cognitive architecture for robots. Finally, this work opens multiple research lines in the future. The future development research lines can be divided into three main categories which are Integrating other cognitive components, Improving the individual modalities, and Applications, and Modeling the impairments.