Deep learning in data scarcity scenarios

DAVILA CARRAZCO, JULIO IVAN

doi:10.15167/davila-carrazco-julio-ivan_phd2024-03-29

The availability of data serves as the fundamental pillar upon which deep learning models rely. The performance of such models is highly dependent on the quality of the data fed to train them. In a perfect scenario, the data would consist of input-output pairs that capture all possible aspects of the task that the deep learning model needs to solve. Devoid of this perfect data, these models would neither perform as effectively nor come into existence. Nevertheless, the process of acquiring data can often evolve into a laborious and expensive undertaking, fraught with challenges and offering no assurance of data quality. Therefore, research has been directed toward circumventing this data-scarcity environment while still enabling the training of deep learning models. In response to the challenge of data scarcity, researchers pioneered techniques aimed at training DL models robustly across diverse domains. Some of these techniques work by imparting models with the ability to generalize their learning from one domain to another, thereby improving their performance on diverse data distributions during inference. Others implement data manipulation, introducing diversity into training data and enhancing the model’s robustness. Another set of approaches exploits representation learning, enabling models to autonomously acquire meaningful features. However, many implementations have primarily focused on scenarios without a specific target domain, meaning the models are designed to generalize across all possible distributions. Although generalizing to many domains may sound like a perfect goal, typically, such domains retain high similarity with the source domains. Consequently, if a target domain were to have characteristics so unique that it barely resembles the source, then domain generalization may not perform as well as intended. This leads to a new question: What can be done to train a model for a specific target? When training for a specific target domain, various methodologies are tailored to address this scenario, employing different techniques based on the available information. In certain cases, the information from the target domain might be limited to class descriptions existing within both the source and target domains. Approaches designed for this scenario, such as Zero-Shot Learning, utilize this description information to establish relationships between the source and target domains. Other methodologies, like in Domain Adaptation, seek to map the target distributions to a shared feature space in which the confusion between domains is maximized. However, this approach is generally used when there are enough data samples from both domains. In the unsupervised scenario (UDA), some approaches exploit adversarial training to maximize the confusion between domains, as this procedure does not require labeled samples. However, in the extreme case where only a target sample is available (One-Shot UDA), the approach is to leverage data augmentation to generate more samples. Although this last scenario has not been thoroughly researched. In this thesis, our primary focus is on addressing the aforementioned question. We achieve this by delineating various methodologies, each tailored to specific aspects of the problem at hand. These aspects include scenarios with an absence of available input-output pairs for training, such as in Zero-Shot Learning. In this scenario, we define a data augmentation approach to reduce bias towards the source domain. Additionally, we explore the extreme case where only one input without output (i.e., unannotated) exists, as seen in One-Shot UDA. In One-Shot UDA, we leverage data augmentation and style transfer to generate samples of the target domain. Through a diverse set of experiments, these novel methodologies have demonstrated their efficiency in tackling complex scenarios characterized by data scarcity.