Discrimination-aware data transformations

IRIS

A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far. To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches. Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit.

Discrimination-aware data transformations

ACCINELLI, CHIARA

2023-05-25

Abstract

A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far. To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches. Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione della tesi
	
				25-mag-2023
			
	Parole chiave
	
				coverage; nondiscrimination; data transformations; rewriting; fairness
			
	Appare nelle tipologie:
	
				Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
phdunige_3932140.pdf accesso aperto Tipologia: Tesi di dottorato Dimensione 28.74 MB Formato Adobe PDF Visualizza/Apri	28.74 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1118607

Citazioni

ND

ND

ND

social impact