Due to the impact of analytical processes on our life, an increasing effort is being devoted to the design of technological solutions that help humans in measuring the bias introduced by such processes and understanding its causes. Existing solutions can refer to either back-end or front-end stages of the data processing pipeline and usually represent bias in terms of some given diversity or fairness constraint. In our previous work [1], we proposed an approach for rewriting filtering and merge operations in pre-processing pipelines into the “closest” operations so that protected groups are adequately represented (i.e., covered) in the result. This is relevant because any under-represented category in an initial or intermediate dataset might lead to an under-representation of that category in any subsequent analytical process. Since many potential rewritings exist, the proposed approach is approximate and relies on a sample-based cardinality estimation, thus introducing a trade-off between the accuracy and the efficiency of the process. In this paper, we investigate this trade-off by first presenting various measures quantifying the error introduced by the rewriting, due to the applied approximation and the selected sample. Then, we (preliminarly) experimentally evaluate such measures on a real-world dataset.
The impact of rewriting on coverage constraint satisfaction
Accinelli C.;Catania B.;Guerrini G.;Minisi S.
2021-01-01
Abstract
Due to the impact of analytical processes on our life, an increasing effort is being devoted to the design of technological solutions that help humans in measuring the bias introduced by such processes and understanding its causes. Existing solutions can refer to either back-end or front-end stages of the data processing pipeline and usually represent bias in terms of some given diversity or fairness constraint. In our previous work [1], we proposed an approach for rewriting filtering and merge operations in pre-processing pipelines into the “closest” operations so that protected groups are adequately represented (i.e., covered) in the result. This is relevant because any under-represented category in an initial or intermediate dataset might lead to an under-representation of that category in any subsequent analytical process. Since many potential rewritings exist, the proposed approach is approximate and relies on a sample-based cardinality estimation, thus introducing a trade-off between the accuracy and the efficiency of the process. In this paper, we investigate this trade-off by first presenting various measures quantifying the error introduced by the rewriting, due to the applied approximation and the selected sample. Then, we (preliminarly) experimentally evaluate such measures on a real-world dataset.File | Dimensione | Formato | |
---|---|---|---|
PIE+Q_2.pdf
accesso aperto
Descrizione: Contributo in atti di convegno
Tipologia:
Documento in versione editoriale
Dimensione
1.72 MB
Formato
Adobe PDF
|
1.72 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.