A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far. To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches. Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit.
Discrimination-aware data transformations
ACCINELLI, CHIARA
2023-05-25
Abstract
A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far. To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches. Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit.File | Dimensione | Formato | |
---|---|---|---|
phdunige_3932140.pdf
accesso aperto
Tipologia:
Tesi di dottorato
Dimensione
28.74 MB
Formato
Adobe PDF
|
28.74 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.